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Preface 


Causality is a fascinating topic of research. Its mathematization has only relatively 
recently started, and many conceptual problems are still being debated — often 
with considerable intensity. 

While this book summarizes the results of spending a decade assaying causality, 
others have studied this problem much longer than we have, and there already exist 
books about causality, including the comprehensive treatments of Pearl [2009], 
Spirtes et al. [2000], and Imbens and Rubin [2015]. We hope that our book is able 
to complement existing work in two ways. 

First, the present book represents a bias toward a subproblem of causality that 
may be considered both the most fundamental and the least realistic. This is the 
cause-effect problem, where the system under analysis contains only two observ- 
ables. We have studied this problem in some detail during the last decade. We 
report much of this work, and try to embed it into a larger context of what we con- 
sider fundamental for gaining a selective but profound understanding of the issues 
of causality. Although it might be instructive to study the bivariate case first, fol- 
lowing the sequential chapter order, it is also possible to directly start reading the 
multivariate chapters; see Figure I. 

And second, our treatment is motivated and influenced by the fields of machine 
learning and computational statistics. We are interested in how methods thereof 
can help with the inference of causal structures, and even more so whether causal 
reasoning can inform the way we should be doing machine learning. Indeed, we 
feel that some of the most profound open issues of machine learning are best under- 
stood if we do not take a random experiment described by a probability distribution 
as our starting point, but instead we consider causal structures underlying the dis- 
tribution. 

We try to provide a systematic introduction into the topic that is accessible to 
readers familiar with the basics of probability theory and statistics or machine 


xii Preface 


learning (for completeness, the most important concepts are summarized in Ap- 
pendices A.1 and A.2). 

While we build on the graphical approach to causality as represented by the work 
of Pearl [2009] and Spirtes et al. [2000], our personal taste influenced the choice 
of topics. To keep the book accessible and focus on the conceptual issues, we were 
forced to devote regrettably little space to a number of significant issues in causal- 
ity, be it advanced theoretical insights for particular settings or various methods of 
practical importance. We have tried to include references to the literature for some 
of the most glaring omissions, but we may have missed important topics. 

Our book has a number of shortcomings. Some of them are inherited from the 
field, such as the tendency that theoretical results are often restricted to the case 
where we have infinite amounts of data. Although we do provide algorithms and 
methodology for the finite data case, we do not discuss statistical properties of such 
methods. Additionally, at some places we neglect measure theoretic issues, often 
by assuming the existence of densities. We find all of these questions both relevant 
and interesting but made these choices to keep the book short and accessible to a 
broad audience. 

Another disclaimer is in order. Computational causality methods are still in their 
infancy, and in particular, learning causal structures from data is only doable in 
rather limited situations. We have tried to include concrete algorithms wherever 
possible, but we are acutely aware that many of the problems of causal inference 
are harder than typical machine learning problems, and we thus make no promises 
as to whether the algorithms will work on the reader’s problems. Please do not feel 
discouraged by this remark — causal learning is a fascinating topic and we hope 
that reading this book may convince you to start working on it. 

We would have not been able to finish this book without the support of various 
people. 

We gratefully acknowledge support for a Research in Pairs stay of the three au- 
thors at the Mathematisches Forschungsinstitut Oberwolfach, during which a sub- 
stantial part of this book was written. 

We thank Michel Besserve, Peter Bühlmann, Rune Christiansen, Frederick Eber- 
hardt, Jan Ernest, Philipp Geiger, Niels Richard Hansen, Alain Hauser, Biwei 
Huang, Marek Kaluba, Hansruedi Künsch, Steffen Lauritzen, Jan Lemeire, David 
Lopez-Paz, Marloes Maathuis, Nicolai Meinshausen, Søren Wengel Mogensen, 
Joris Mooij, Krikamol Muandet, Judea Pearl, Niklas Pfister, Thomas Richardson, 
Mateo Rojas-Carulla, Eleni Sgouritsa, Carl Johann Simon-Gabriel, Xiaohai Sun, 
Ilya Tolstikhin, Kun Zhang, and Jakob Zscheischler for many helpful comments 
and interesting discussions during the time this book was written. In particular, 
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Joris and Kun were involved in much of the research that is presented here. 

We thank various students at Karlsruhe Institute of Technology, Eidgenössische 
Technische Hochschule Ziirich, and University of Tiibingen for proofreading early 
versions of this book and for asking many inspiring questions. 

Finally, we thank the anonymous reviewers and the copyediting team from West- 
chester Publishing Services for their helpful comments, and the staff from MIT 
Press, in particular Marie Lufkin Lee and Christine Bridget Savage, for providing 
kind support during the whole process. 


København and Tübingen, August 2017 
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Notation and Terminology 
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random variable; for noise variables, we use N, Nx, Nj,... 
value of a random variable X 

probability measure 

probability distribution of X 


an i.i.d. sample of size n; sample index is usually i 
conditional distribution of Y given X = x 

collection of Py|y—, for all x; for short: conditional of Y 
given X 

density (either probability mass function or probability 
density function) 

density of Py 

density of Py evaluated at the point x 

(conditional) density of Py|y—, evaluated at y 

expectation of X 

variance of X 

covariance of X,Y 

independence between random variables X and Y 
conditional independence 

random vector of length d; dimension index is usually j 
structural causal model 

intervention distribution 

counterfactual distribution 

graph 

parents, descendants, and ancestors of node X in graph G 


Statistical and Causal Models 


Using statistical learning, we try to infer properties of the dependence among ran- 
dom variables from observational data. For instance, based on a joint sample of 
observations of two random variables, we might build a predictor that, given new 
values of only one of them, will provide a good estimate of the other one. The 
theory underlying such predictions is well developed, and — although it applies to 
simple settings — already provides profound insights into learning from data. For 
two reasons, we will describe some of these insights in the present chapter. First, 
this will help us appreciate how much harder the problems of causal inference 
are, where the underlying model is no longer a fixed joint distribution of random 
variables, but a structure that implies multiple such distributions. Second, although 
finite sample results for causal estimation are scarce, it is important to keep in mind 
that the basic statistical estimation problems do not go away when moving to the 
more complex causal setting, even if they seem small compared to the causal prob- 
lems that do not appear in purely statistical learning. Building on the preceding 
groundwork, the chapter also provides a gentle introduction to the basic notions of 
causality, using two examples, one of which is well known from machine learning. 


1.1 Probability Theory and Statistics 


Probability theory and statistics are based on the model of a random experiment or 
probability space (Q,*,P). Here, Q is a set (containing all possible outcomes), 
F is a collection of events A C Q, and P is a measure assigning a probability to 
each event. Probability theory allows us to reason about the outcomes of random 
experiments, given the preceding mathematical structure. Statistical learning, on 
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the other hand, essentially deals with the inverse problem: We are given the out- 
comes of experiments, and from this we want to infer properties of the underlying 
mathematical structure. For instance, suppose that we have observed data 


(Sis Hi) peng a) (1.1) 


where x; € ¥ are inputs (sometimes called covariates or cases) and y; € V are 
outputs (sometimes called targets or labels). We may now assume that each 


(xi,yi), i= 1,...,n, has been generated independently by the same unknown ran- 
dom experiment. More precisely, such a model assumes that the observations 
(x1,¥1),--+;(Xn,Yn) are realizations of random variables (X1,Yi),...,(Xn,¥n) that 


are i.i.d. (independent and identically distributed) with joint distribution Py y. 
Here, X and Y are random variables taking values in metric spaces V and Y.! Al- 
most all of statistics and machine learning builds on i.i.d. data. In practice, the i.i.d. 
assumption can be violated in various ways, for instance if distributions shift or in- 
terventions in a system occur. As we shall see later, some of these are intricately 
linked to causality. 

We may now be interested in certain properties of Py y, such as: 


(i) the expectation of the output given the input, f(x) = E[Y|X = x], called 
regression, where often V = R, 


(ii) a binary classifier assigning each x to the class that is more likely, f(x) = 
argmax cy P(Y =y|X =x), where Y = {+1}, 


(iii) the density py y of Py y (assuming it exists). 


In practice, we seek to estimate these properties from finite data sets, that is, based 
on the sample (1.1), or equivalently an empirical distribution Py y that puts a point 
mass of equal weight on each observation. . 

This constitutes an inverse problem: We want to estimate a property of an object 
we cannot observe (the underlying distribution), based on observations that are 
obtained by applying an operation (in the present case: sampling from the unknown 
distribution) to the underlying object. 


lA random variable X is a measurable function Q — X, where the metric space Æ is equipped 
with the Borel o-algebra. Its distribution Py on Æ can be obtained from the measure P of the under- 
lying probability space (Q, F,P). We need not worry about this underlying space, and instead we 
generally start directly with the distribution of the random variables, assuming the random experi- 
ment directly provides us with values sampled from that distribution. 
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1.2 Learning Theory 


Now suppose that just like we can obtain f from Py y, we use the empirical distri- 
bution to infer empirical estimates f”. This turns out to be an ill-posed problem 
[e.g., Vapnik, 1998], since for any values of x that we have not seen in the sample 
(x1,¥1),--+;(Xn,Yn), the conditional expectation is undefined. We may, however, 
define the function f on the observed sample and extend it according to any fixed 
tule (e.g., setting f to +1 outside the sample or by choosing a continuous piecewise 
linear f). But for any such choice, small changes in the input, that is, in the em- 
pirical distribution, can lead to large changes in the output. No matter how many 
observations we have, the empirical distribution will usually not perfectly approx- 
imate the true distribution, and small errors in this approximation can then lead 
to large errors in the estimates. This implies that without additional assumptions 
about the class of functions from which we choose our empirical estimates f”, we 
cannot guarantee that the estimates will approximate the optimal quantities f in a 
suitable sense. In statistical learning theory, these assumptions are formalized in 
terms of capacity measures. If we work with a function class that is so rich that 
it can fit most conceivable data sets, then it is not surprising if we can fit the data 
at hand. If, however, the function class is a priori restricted to have small capacity, 
then there are only a few data sets (out of the space of all possible data sets) that 
we can explain using a function from that class. If it turns out that nevertheless we 
can explain the data at hand, then we have reason to believe that we have found a 
regularity underlying the data. In that case, we can give probabilistic guarantees 
for the solution’s accuracy on future data sampled from the same distribution Py y. 

Another way to think of this is that our function class has incorporated a priori 
knowledge (such as smoothness of functions) consistent with the regularity un- 
derlying the observed data. Such knowledge can be incorporated in various ways, 
and different approaches to machine learning differ in how they handle the issue. In 
Bayesian approaches, we specify prior distributions over function classes and noise 
models. In regularization theory, we construct suitable regularizers and incorporate 
them into optimization problems to bias our solutions. 

The complexity of statistical learning arises primarily from the fact that we are 
trying to solve an inverse problem based on empirical data — if we were given 
the full probabilistic model, then all these problems go away. When we discuss 
causal models, we will see that in a sense, the causal learning problem is harder 
in that it is ill-posed on two levels. In addition to the statistical ill-posed-ness, 
which is essentially because a finite sample of arbitrary size will never contain all 
information about the underlying distribution, there is an ill-posed-ness due to the 
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fact that even complete knowledge of an observational distribution usually does 
not determine the underlying causal model. 

Let us look at the statistical learning problem in more detail, focusing on the 
case of binary pattern recognition or classification [e.g., Vapnik, 1998], where 
V = {+1}. We seek to learn f : X — Y based on observations (1.1), generated 
iid. from an unknown Py y. Our goal is to minimize the expected error or risk? 


fl= f sire ~y| dBc (x,y) (1.2) 


over some class of functions F. Note that this is an integral with respect to the 
measure Py y; however, if Py y allows for a density p(x,y) with respect to Lebesgue 
measure, the integral reduces to f 5| f(x) —y| p(x,y) dxdy. 

Since Py y is unknown, we cannot compute (1.2), let alone minimize it. Instead, 
we appeal to an induction principle, such as empirical risk minimization. We 
return the function minimizing the training error or empirical risk 


"1 
= lhe -yıl (1.3) 


over f E€ F. From the asymptotic point of view, it is important to ask whether 
such a procedure is consistent, which essentially means that it produces a se- 
quence of functions whose risk converges towards the minimal possible within 
the given function class F (in probability) as n tends to infinity. In Appendix A.3, 
we show that this can only be the case if the function class is “small?” The Vapnik- 
Chervonenkis (VC) dimension [Vapnik, 1998] is one possibility of measuring the 
capacity or size of a function class. It also allows us to derive finite sample guaran- 
tees, stating that with high probability, the risk (1.2) is not larger than the empirical 
risk plus a term that grows with the size of the function class F. 

Such a theory does not contradict the existing results on universal consistency, 
which refers to convergence of a learning algorithm to the lowest achievable risk 
with any function. There are learning algorithms that are universally consistent, 
for instance nearest neighbor classifiers and Support Vector Machines [Devroye 
et al., 1996, Vapnik, 1998, Scholkopf and Smola, 2002, Steinwart and Christmann, 
2008]. While universal consistency essentially tells us everything can be learned in 


R” 


moli 


Pe 


This notion of risk, which does not always coincide with its colloquial use, is taken from sta- 
tistical learning theory [Vapnik, 1998] and has its roots in statistical decision theory [Wald, 1950, 
Ferguson, 1967, Berger, 1985]. In that context, f(x) is thought of as an action taken upon observing 
x, and the loss function measures the loss incurred when the state of nature is y. 
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the limit of infinite data, it does not imply that every problem is learnable well from 
finite data, due to the phenomenon of slow rates. For any learning algorithm, there 
exist problems for which the learning rates are arbitrarily slow [Devroye et al., 
1996]. It does tell us, however, that if we fix the distribution, and gather enough 
data, then we can get arbitrarily close to the lowest risk eventually. 

In practice, recent successes of machine learning systems seem to suggest that 
we are indeed sometimes already in this asymptotic regime, often with spectacular 
results. A lot of thought has gone into designing the most data-efficient methods 
to obtain the best possible results from a given data set, and a lot of effort goes 
into building large data sets that enable us to train these methods. However, in all 
these settings, it is crucial that the underlying distribution does not differ between 
training and testing, be it by interventions or other changes. As we shall argue in 
this book, describing the underlying regularity as a probability distribution, without 
additional structure, does not provide us with the right means to describe what 
might change. 


1.3 Causal Modeling and Learning 


Causal modeling starts from another, arguably more fundamental, structure. A 
causal structure entails a probability model, but it contains additional information 
not contained in the latter (see the examples in Section 1.4). Causal reasoning, 
according to the terminology used in this book, denotes the process of drawing 
conclusions from a causal model, similar to the way probability theory allows us to 
reason about the outcomes of random experiments. However, since causal models 
contain more information than probabilistic ones do, causal reasoning is more pow- 
erful than probabilistic reasoning, because causal reasoning allows us to analyze 
the effect of interventions or distribution changes. 

Just like statistical learning denotes the inverse problem to probability theory, we 
can think about how to infer causal structures from its empirical implications. The 
empirical implications can be purely observational, but they can also include data 
under interventions (e.g., randomized trials) or distribution changes. Researchers 
use various terms to refer to these problems, including structure learning and 
causal discovery. We refer to the closely related question of which parts of the 
causal structure can in principle be inferred from the joint distribution as struc- 
ture identifiability. Unlike the standard problems of statistical learning described 
in Section 1.2, even full knowledge of P does not make the solution trivial, and 
we need additional assumptions (see Chapters 2, 4, and 7). This difficulty should 
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causal learning 


ea oe observations & 
outcomes incl. 
causal model 
ao. changes & 


interventions 


causal reasoning 


| 
subsumes 
ı subsume 

| 

statistical learning i 

Y Y 
probabilistic model observations 
ee & outcomes 


probabilistic reasoning 


Figure 1.1: Terminology used by the present book for various probabilistic inference 
problems (bottom) and causal inference problems (top); see Section 1.3. Note that we use 
the term “inference” to include both learning and reasoning. 


not distract us from the fact, however, that the ill-posed-ness of the usual statisti- 
cal problems is still there (and thus it is important to worry about the capacity of 
function classes also in causality, such as by using additive noise models — see 
Section 4.1.4 below), only confounded by an additional difficulty arising from the 
fact that we are trying to estimate a richer structure than just a probabilistic one. 
We will refer to this overall problem as causal learning. Figure 1.1 summarizes 
the relationships between the preceding problems and models. 

To learn causal structures from observational distributions, we need to understand 
how causal models and statistical models relate to each other. We will come back 
to this issue in Chapters 4 and 7 but provide an example now. A well-known topos 
holds that correlation does not imply causation; in other words, statistical proper- 
ties alone do not determine causal structures. It is less well known that one may 
postulate that while we cannot infer a concrete causal structure, we may at least in- 
fer the existence of causal links from statistical dependences. This was first under- 
stood by Reichenbach [1956]; we now formulate his insight (see also Figure 1.2).° 


3For clarity, we formulate some important assumptions as principles. We do not take them for 
granted throughout the book; in this sense, they are not axioms. 
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Figure 1.2: Reichenbach’s common cause principle establishes a link between statistical 
properties and causal structures. A statistical dependence between two observables X and 
Y indicates that they are caused by a variable Z, often referred to as a confounder (left). 
Here, Z may coincide with either X or Y, in which case the figure simplifies (middle/right). 
The principle further argues that X and Y are statistically independent, conditional on Z. 
In this figure, direct causation is indicated by arrows; see Chapters 3 and 6. 


Principle 1.1 (Reichenbach’s common cause principle) Jf two random vari- 
ables X and Y are statistically dependent (X }{ Y ), then there exists a third variable 
Z that causally influences both. (As a special case, Z may coincide with either X 
or Y.) Furthermore, this variable Z screens X and Y from each other in the sense 
that given Z, they become independent, X IL Y |Z. 


In practice, dependences may also arise for a reason different from the ones men- 
tioned in the common cause principle, for instance: (1) The random variables we 
observe are conditioned on others (often implicitly by a selection bias). We shall 
return to this issue; see Remark 6.29. (2) The random variables only appear to 
be dependent. For example, they may be the result of a search procedure over a 
large number of pairs of random variables that was run without a multiple testing 
correction. In this case, inferring a dependence between the variables does not sat- 
isfy the desired type I error control; see Appendix A.2. (3) Similarly, both random 
variables may inherit a time dependence and follow a simple physical law, such 
as exponential growth. The variables then look as if they depend on each other, 
but because the i.i.d. assumption is violated, there is no justification of applying 
a standard independence test. In particular, arguments (2) and (3) should be kept 
in mind when reporting “spurious correlations” between random variables, as it is 
done on many popular websites. 


1.4 Two Examples 


1.4.1 Pattern Recognition 


As the first example, we consider optical character recognition, a well-studied 
problem in machine learning. This is not a run-of-the-mill example of a causal 
structure, but it may be instructive for readers familiar with machine learning. We 
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describe two causal models giving rise to a dependence between two random vari- 
ables, which we will assume to be handwritten digits X and class labels Y. The two 
models will lead to the same statistical structure, using distinct underlying causal 
structures. 

Model (i) assumes we generate each pair of observations by providing a sequence 
of class labels y to a human writer, with the instruction to always produce a corre- 
sponding handwritten digit image x. We assume that the writer tries to do a good 
job, but there may be noise in perceiving the class label and executing the motor 
program to draw the image. We can model this process by writing the image X as a 
function (or mechanism) f of the class label Y (modeled as a random variable) and 
some independent noise Ny (see Figure 1.3, left). We can then compute Py y from 
Py, Py,, and f. This is referred to as the observational distribution, where the 
word “observational” refers to the fact that we are passively observing the system 
without intervening. X and Y will be dependent random variables, and we will be 
able to learn the mapping from x to y from observations and predict the correct 
label y from an image x better than chance. 

There are two possible interventions in this causal structure, which lead to inter- 
vention distributions.’ If we intervene on the resulting image X (by manipulating 
it, or exchanging it for another image after it has been produced), then this has no 
effect on the class labels that were provided to the writer and recorded in the data 
set. Formally, changing X has no effect on Y since Y := Ny. Intervening on Y, on 
the other hand, amounts to changing the class labels provided to the writer. This 
will obviously have a strong effect on the produced images. Formally, changing Y 
has an effect on X since X := f(Y,Nx). This directionality is visible in the arrow 
in the figure, and we think of this arrow as representing direct causation. 

In alternative model (ii), we assume that we do not provide class labels to the 
writer. Rather, the writer is asked to decide himself or herself which digits to write, 
and to record the class labels alongside. In this case, both the image X and the 
recorded class label Y are functions of the writer’s intention (call it Z and think 
of it as a random variable). For generality, we assume that not only the process 
generating the image is noisy but also the one recording the class label, again with 
independent noise terms (see Figure 1.3, right). Note that if the functions and noise 
terms are chosen suitably, we can ensure that this model entails an observational 
distribution Py y that is identical to the one entailed by model (i). 


4We shall see in Section 6.3 that a more general way to think of interventions is that they change 
functions and random variables. 


Indeed, Proposition 4.1 implies that any joint distribution Py y can be entailed by both models. 
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Figure 1.3: Two structural causal models of handwritten digit data sets. In the left 
model (i), a human is provided with class labels Y and produces images X. In the right 
model (ii), the human decides which class to write (Z) and produces both images and class 
labels. For suitable functions f,g,h and noise variables Ny ,My,My,Z, the two models 
produce the same observable distribution Py y, yet they are interventionally different; see 
Section 1.4.1. 


Let us now discuss possible interventions in model (ii). If we intervene on the 
image X, then things are as we just discussed and the class label Y is not affected. 
However, if we intervene on the class label Y (i.e., we change what the writer has 
recorded as the class label), then unlike before this will not affect the image. 

In summary, without restricting the class of involved functions and distributions, 
the causal models described in (i) and (ii) induce the same observational distribu- 
tion over X and Y, but different intervention distributions. This difference is not 
visible in a purely probabilistic description (where everything derives from Py y). 
However, we were able to discuss it by incorporating structural knowledge about 
how Py y comes about, in particular graph structure, functions, and noise terms. 

Models (i) and (ii) are examples of structural causal models (SCMs), some- 
times referred to as structural equation models [e.g., Aldrich, 1989, Hoover, 
2008, Pearl, 2009, Pearl et al., 2016]. In an SCM, all dependences are generated by 
functions that compute variables from other variables. Crucially, these functions 
are to be read as assignments, that is, as functions as in computer science rather 
than as mathematical equations. We usually think of them as modeling physical 
mechanisms. An SCM entails a joint distribution over all observables. We have 
seen that the same distribution can be generated by different SCMs, and thus in- 
formation about the effect of interventions (and, as we shall see in Section 6.4, 
information about counterfactuals) may be lost when we make the transition from 
an SCM to the corresponding probability model. In this book, we take SCMs as 
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our starting point and try to develop everything from there. 

We conclude with two points connected to our example: 

First, Figure 1.3 nicely illustrates Reichenbach’s common cause principle. The 
dependence between X and Y admits several causal explanations, and X and Y 
become independent if we condition on Z in the right-hand figure: The image and 
the label share no information that is not contained in the intention. 

Second, it is sometimes said that causality can only be discussed when taking 
into account the notion of time. Indeed, time does play a role in the preceding 
example, for instance by ruling out that an intervention on X will affect the class 
label. However, this is perfectly fine, and indeed it is quite common that a sta- 
tistical data set is generated by a process taking place in time. For instance, in 
model (i), the underlying reason for the statistical dependence between X and Y 
is a dynamical process. The writer reads the label and plans a movement, entail- 
ing complicated processes in the brain, and finally executes the movement using 
muscles and a pen. This process is only partly understood, but it is a physical, 
dynamical process taking place in time whose end result leads to a non-trivial joint 
distribution of X and Y. When we perform statistical learning, we only care about 
the end result. Thus, not only causal structures, but also purely probabilistic struc- 
tures may arise through processes taking place in time — indeed, one could hold 
that this is ultimately the only way they can come about. However, in both cases, 
it is often instructive to disregard time. In statistics, time is often not necessary 
to discuss concepts such as statistical dependence. In causal models, time is often 
not necessary to discuss the effect of interventions. But both levels of description 
can be thought of as abstractions of an underlying more accurate physical model 
that describes reality more fully than either; see Table 1.1. Moreover, note that 
variables in a model may not necessarily refer to well-defined time instances. If, 
for instance, a psychologist investigates the statistical or causal relation between 
the motivation and the performance of students, both variables cannot easily be 
assigned to specific time instances. Measurements that refer to well-defined time 
instances are rather typical for “hard” sciences like physics and chemistry. 


1.4.2 Gene Perturbation 


We have seen in Section 1.4.1 that different causal structures lead to different in- 
tervention distributions. Sometimes, we are indeed interested in predicting the 
outcome of a random variable under such an intervention. Consider the following, 
in some ways oversimplified, example from genetics. Assume that we are given 
activity data from gene A and, correspondingly, measurements of a phenotype; see 
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Predict Predict under Answer Obtain Learn 
Model in i.i.d. changing distr. counterfactual | physical from 
setting or intervention questions insight data 
Mechanistic/ 
physical, e.g., yes yes yes yes ? 
Sec. 2.3 
Structural 
causal model, yes yes yes ? ? 
e.g., Sec. 6.2 
Causal graphi- 
cal model, yes yes no ? ? 
e.g., Sec. 6.5.2 
Statistical 
model, e.g., yes no no no yes 
Sec. 1.2 


Table 1.1: A simple taxonomy of models. The most detailed model (top) is a mechanis- 
tic or physical one, usually involving sets of differential equations. At the other end of the 
spectrum (bottom), we have a purely statistical model; this model can be learned from data, 
but it often provides little insight beyond modeling associations between epiphenomena. 
Causal models can be seen as descriptions that lie in between, abstracting away from phys- 
ical realism while retaining the power to answer certain interventional or counterfactual 
questions. See Mooij et al. [2013] for a discussion of the link between physical models 
and structural causal models, and Section 6.3 for a discussion of interventions. 


Figure 1.4 (top left) for a toy data set. Clearly, both variables are strongly corre- 
lated. This correlation can be exploited for classical prediction: If we observe that 
the activity of gene A lies around 6, we expect the phenotype to lie between 12 and 
16 with high probability. Similarly, for a gene B (bottom left). On the other hand, 
we may also be interested in predicting the phenotype after deleting gene A, that 
is, after setting its activity to 0.° Without any knowledge of the causal structure, 
however, it is impossible to provide a non-trivial answer. If gene A has a causal 
influence on the phenotype, we expect to see a drastic change after the intervention 
(see top right). In fact, we may still be able to use the same linear model that we 
have learned from the observational data. If, alternatively, there is a common cause, 
possibly a third gene C, influencing both the activity of gene B and the phenotype, 
the intervention on gene B will have no effect on the phenotype (see bottom right). 


Let us for simplicity assume that we have access to the true activity of the gene without mea- 
surement noise. 
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As in the pattern recognition example, the models are again chosen such that 
the joint distribution over gene A and the phenotype equals the joint distribution 
over gene B and the phenotype. Therefore, there is no way of telling between the 
top and bottom situation from just observational data, even if sample size goes to 
infinity. Summarizing, if we are not willing to employ concepts from causality, 
we have to answer “I do not know” to the question of predicting a phenotype after 
deletion of a gene. 
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Figure 1.4: The activity of two genes (top: gene A; bottom: gene B) is strongly correlated 
with the phenotype (black dots). However, the best prediction for the phenotype when 
deleting the gene, that is, setting its activity to O (left), depends on the causal structure 
(right). If a common cause is responsible for the correlation between gene and pheno- 
type, we expect the phenotype to behave under the intervention as it usually does (bottom 
right), whereas the intervention clearly changes the value of the phenotype if it is causally 
influenced by the gene (top right). The idea of this figure is based on Peters et al. [2016]. 


2 


Assumptions for Causal Inference 


Now that we have encountered the basic components of SCMs, it is a good time to 
pause and consider some of the assumptions we have seen, as well as what these 
assumptions imply for the purpose of causal reasoning and learning. 

A crucial notion in our discussion will be a form of independence, and we can 
informally introduce it using an optical illusion known as the Beuchet chair. When 
we see an object such as the one on the left of Figure 2.1, our brain makes the 
assumption that the object and the mechanism by which the information contained 
in its light reaches our brain are independent. We can violate this assumption by 
looking at the object from a very specific viewpoint. If we do that, perception goes 
wrong: We perceive the three-dimensional structure of a chair, which in reality is 
not there. Most of the time, however, the independence assumption does hold. If 
we look at an object, our brain assumes that the object is independent from our 
vantage point and the illumination. So there should be no unlikely coincidences, 
no separate 3D structures lining up in two dimensions, or shadow boundaries coin- 
ciding with texture boundaries. This is called the generic viewpoint assumption in 
vision [Freeman, 1994]. 

The independence assumption is more general than this, though. We will see in 
Section 2.1 below that the causal generative process is composed of autonomous 
modules that do not inform or influence each other. As we shall describe below, 
this means that while one module’s output may influence another module’s input, 
the modules themselves are independent of each other. 

In the preceding example, while the overall percept is a function of object, light- 
ing, and viewpoint, the object and the lighting are not affected by us moving about 
— in other words, some components of the overall causal generative model remain 
invariant, and we can infer three-dimensional information from this invariance. 
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Figure 2.1: The left panel shows a generic view of the (separate) parts comprising a 
Beuchet chair. The right panel shows the illusory percept of a chair if the parts are viewed 
from a single, very special vantage point. From this accidental viewpoint, we perceive a 
chair. (Image courtesy of Markus Elsholz.) 


This is the basic idea of structure from motion (Ullman, 1979], which plays a cen- 
tral role in both biological vision and computer vision. 


2.1 The Principle of Independent Mechanisms 


We now describe a simple cause-effect problem and point out several observations. 
Subsequently, we shall try to provide a unified view of how these observation relate 
to each other, arguing that they derive from a common independence principle. 

Suppose we have estimated the joint density p(a,t) of the altitude A and the 
average annual temperature T of a sample of cities in some country (see Figure 4.6 
on page 65). Consider the following ways of expressing p(a,t): 


p(a,t) = p(alt) p(t) 
= P(tla) p(a) (2.1) 
The first decomposition describes T and the conditional A|T. It corresponds to a 


factorization of p(a,t) according to the graph T — A.! The second decomposition 
corresponds to a factorization according to A — T (cf. Definition 6.21). Can we 


'Note that the conditional density p(a|t) allows us to compute p(a,t) (and thus also p(a)) from 
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decide which of the two structures is the causal one (i.e., in which case would we 
be able to think of the arrow as causal)? 

A first idea (see Figure 2.2, left) is to consider the effect of interventions. Imag- 
ine we could change the altitude A of a city by some hypothetical mechanism that 
raises the grounds on which the city is built. Suppose that we find that the average 
temperature decreases. Let us next imagine that we devise another intervention ex- 
periment. This time, we do not change the altitude, but instead we build a massive 
heating system around the city that raises the average temperature by a few de- 
grees. Suppose we find that the altitude of the city is unaffected. Intervening on A 
has changed T, but intervening on T has not changed A. We would thus reasonably 
prefer A — T as a description of the causal structure. 

Why do we find this description of the effect of interventions plausible, even 
though the hypothetical intervention is hard or impossible to carry out in practice? 

If we change the altitude A, then we assume that the physical mechanism p(t|a) 
responsible for producing an average temperature (e.g., the chemical composition 
of the atmosphere, the physics of how pressure decreases with altitude, the mete- 
orological mechanisms of winds) is still in place and leads to a changed T. This 
would hold true independent of the distribution from which we have sampled the 
cities, and thus independent of p(a). Austrians may have founded their cities in 
locations subtly different from those of the Swiss, but the mechanism p(t|a) would 
apply in both cases.” 

If, on the other hand, we change T, then we have a hard time thinking of p(a|f) 
as a mechanism that is still in place — we probably do not believe that such a 
mechanism exists in the first place. Given a set of different city distributions p(a,t), 
while we could write them all as p(a|t) p(t), we would find that it is impossible to 
explain them all using an invariant p(alr). 

Our intuition can be rephrased and postulated in two ways: If A — T is the correct 
causal structure, then 


(i) it is in principle possible to perform a localized intervention on A, in other 
words, to change p(a) without changing p(t|a), and 


(ii) p(a) and p(t|a) are autonomous, modular, or invariant mechanisms or 
objects in the world. 


p(t), which may serve to motivate the direction of the arrow in T — A for the time being. This will 
be made precise in Definition 6.21. 

This is an idealized setting — no doubt counterexamples to these general remarks can be con- 
structed. 
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Interestingly, while we started off with a hypothetical intervention experiment to 
arrive at the causal structure, our reasoning ends up suggesting that actual interven- 
tions may not be the only way to arrive at causal structures. We may also be able 
to identify the causal structure by checking, for data sources p(a,t), which of the 
two decompositions (2.1) leads to autonomous or invariant terms. Sticking with 
the preceding example, let us denote the joint distributions of altitude and temper- 
ature in Austria and Switzerland by p°(a,t) and pS(a,t), respectively. These may 
be distinct since Austrians and Swiss founded their cities in different places (i.e., 
p®(a) and pS(a) are distinct). The causal factorizations, however, may still use the 
same conditional, i.e. p°(a,t) = p(t|a) p? (a) and pS(a,t) = p(tla) p$ (a). 

We next describe an idea (see Figure 2.2, middle), closely related to the previous 
example, but different in that it also applies for individual distributions. In the 
causal factorization p(a,t) = p(t\a) p(a), we would expect that the conditional 
density p(t|a) (viewed as a function of t and a) provides no information about the 
marginal density function p(a). This holds true if p(t\a) is a model of a physical 
mechanism that does not care about what distribution p(a) we feed into it. In other 
words, the mechanism is not influenced by the ensemble of cities to which we 
apply it. 

If, on the other hand, we write p(a,t) = p(a|t) p(t), then the preceding indepen- 
dence of cause and mechanism does not apply. Instead, we will notice that to 
connect the observed p(t) and p(a,t), the mechanism p(a|t) would need to take a 
rather peculiar shape constrained by the equation p(a,t) = p(a|t)p(t). This could 
be empirically checked, given an ensemble of cities and temperatures.* 

We have already seen several ideas connected to independence, autonomy, and 
invariance, all of which can inform causal inference. We now turn to a final one 
(see Figure 2.2, right), related to the independence of noise terms and thus best 
explained when rewriting (2.1) as a distribution entailed by an SCM with graph 
A — T, realizing the effect T as a noisy function of the cause A, 


A:= N4, 
T := fr(A,Nr), 


where Nr and WN, are statistically independent noises Ny IL N4. Making suitable 
restrictions on the functional form of fr (see Sections 4.1.3—4.1.6 and 7.1.2) al- 
lows us to identify which of two causal structures (A — T or T — A) has entailed 
the observed p(a,t) (without such restrictions though, we can always realize both 


3We shall formalize this idea in Section 4.1.7. 
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Figure 2.2: The principle of independent mechanisms and its implications for causal infer- 
ence (Principle 2.1). 


decompositions (2.1)). Furthermore, in the multivariate setting and under suitable 
conditions, the assumption of jointly independent noises allows the identification 
of causal structures by conditional independence testing (see Section 7.1.1). 

We like to view all these observations as closely connected instantiations of a 
general principle of (physically) independent mechanisms. 


Principle 2.1 (Independent mechanisms) The causal generative process of a 
system’s variables is composed of autonomous modules that do not inform or in- 
fluence each other. 

In the probabilistic case, this means that the conditional distribution of each 
variable given its causes (i.e., its mechanism) does not inform or influence the 
other conditional distributions. In case we have only two variables, this reduces to 
an independence between the cause distribution and the mechanism producing the 
effect distribution. 


The principle is plausible if we conceive our system as being composed of mod- 
ules comprising (sets of) variables such that the modules represent physically in- 
dependent mechanisms of the world. The special case of two variables has been 
referred to as independence of cause and mechanism (ICM) [Daniušis et al., 2010, 
Shajarisales et al., 2015]. It is obtained by thinking of the input as the result of a 
preparation that is done by a mechanism that is independent of the mechanism that 
turns the input into the output. 

Before we discuss the principle in depth, we should state that not all systems will 
satisfy it. For instance, if the mechanisms that an overall system is composed of 
have been tuned to each other by design or evolution, this independence may be 
violated. 
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We will presently argue that the principle is sufficiently broad to cover the main 
aspects of causal reasoning and causal learning (see Figure 2.2). Let us address 
three aspects, corresponding, from left to right, to the three branches of the tree in 
Figure 2.2. 


1. One way to think of these modules is as physical machines that incorporate 
an input-output behavior. This assumption implies that we can change one 
mechanism without affecting the others — or, in causal terminology, we 
can intervene on one mechanism without affecting the others. Changing a 
mechanism will change its input-output behavior, and thus the inputs other 
mechanisms downstream might receive, but we are assuming that the phys- 
ical mechanisms themselves are unaffected by this change. An assumption 
such as this one is often implicit to justify the possibility of interventions in 
the first place, but one can also view it as a more general basis for causal rea- 
soning and causal learning. If a system allows such localized interventions, 
there is no physical pathway that would connect the mechanisms to each 
other in a directed way by ““meta-mechanisms.” The latter makes it plausi- 
ble that we can also expect a tendency for mechanisms to remain invariant 
with respect to changes within the system under consideration and possibly 
also to some changes stemming from outside the system (see Section 7.1.6). 
This kind of autonomy of mechanisms can be expected to help with trans- 
fer of knowledge learned in one domain to a related one where some of the 
modules coincide with the source domain (see Sections 5.2 and 8.3). 


2. While the discussion of the first aspect focused on the physical aspect of 
independence and its ramifications, there is also an information theoretic as- 
pect that is implied by the above. A time evolution involving several coupled 
objects and mechanisms can generate statistical dependence. This is related 
to our discussion from page 10, where we considered the dependence be- 
tween the class label and the image of a handwritten digit. Similarly, mech- 
anisms that are physically coupled will tend to generate information that can 
be quantified in terms of statistical or algorithmic information measures (see 
Sections 4.1.9 and 6.10 below). 

Here, it is important to distinguish between two levels of information: ob- 
viously, an effect contains information about its cause, but — according to 
the independence principle — the mechanism that generates the effect from 
its cause contains no information about the mechanism generating the cause. 
For a causal structure with more than two nodes, the independence princi- 
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ple states that the mechanism generating every node from its direct causes 
contain no information about each other.* 


3. Finally, we should discuss how the assumption of independent noise terms, 
commonly made in structural equation modeling, is connected to the princi- 
ple of independent mechanism. This connection is less obvious. To this end, 
consider a variable E := f(C,N) where the noise N is discrete. For each 
value s taken by N, the assignment E := f(C,N) reduces to a deterministic 
mechanism E := f*(C) that turns an input C into an output E. Effectively, 
this means that the noise randomly chooses between a number of mecha- 
nisms f* (where the number equals the cardinality of the range of the noise 
variable N). Now suppose the noise variables for two mechanisms at the 
vertices X; and Xx were statistically dependent. Such a dependence could 
ensure, for instance, that whenever one mechanism fi is active at node j, 
we know which mechanism fý is active at node k. This would violate our 
principle of independent mechanisms. 

The preceding paragraph uses the somewhat extreme view of noise vari- 
ables as selectors between mechanisms (see also Section 3.4). In practice, 
the role of the noise might be less pronounced. For instance, if the noise 
is additive (i.e., E := f(C) +N), then its influence on the mechanism is re- 
stricted. In this case, it can only shift the output of the mechanism up or 
down, so it selects between a set of mechanisms that are very similar to each 
other. This is consistent with a view of the noise variables as variables out- 
side the system that we are trying to describe, representing the fact that a 
system can never be totally isolated from its environment. In such a view, 
one would think that a weak dependence of noises may be possible without 
invalidating the principle of independent mechanisms. 


All of the above-mentioned aspects of Principle 2.1 may help for the problem of 
causal learning, in other words, they may provide information about causal struc- 
tures. It is conceivable, however, that this information may in cases be conflicting, 
depending on which assumptions hold true in any given situation. 


“There is an intuitive relation between this aspect of independence and the one described under 1.: 
whenever the mechanisms change independently, the change of one mechanism does not provide 
information on how the others have changed. Despite this overlap, the second independence contains 
an aspect that is not strictly contained in the first one because it is also applicable to a scenario in 
which none of the mechanisms has changed; for example, it refers also to homogeneous data sets. 

5 Although we have so far focused on the two-variable case, we phrase this argument such that it 
also applies for causal structures with more than two variables. 
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FIG. 5. 


Diagram illustrating the casual relations between litter mates (O, O’) and between 
each of them and their parents. H, H’, H”, H,’’’ represent the genetic constitutions of 
the four individuals, G, G’, G”, and G’"’ that of four germ cells. E represents such 
environmental factors as are common to litter mates. D represents other factors, 
largely ontogenetic irregularity. The small letters stand for the various path 
coefficients. 


Figure 2.3: Early path diagram; dam and sire are the female and male parents of a guinea 
pig, respectively. The path coefficients capture the importance of a given path, defined as 
the ratio of the variability of the effect to be found when all causes are constant except 
the one in question, the variability of which is kept unchanged, to the total variability. 
(Reproduced from Wright [1920].) 


2.2 Historical Notes 


The idea of autonomy and invariance is deeply engrained in the concept of struc- 
tural equation models (SEMs) or SCMs. We prefer the latter term, since the term 
SEM has been used in a number of contexts where the structural assignments are 
used as algebraic equations rather than assignments. The literature is wide ranging, 
with overviews provided by Aldrich [1989], Hoover [2008], and Pearl [2009]. 

An intellectual antecedent to SEMs is the concept of a path model pioneered 
by Wright [1918, 1920, 1921] (see Figure 2.3). Although Wright was a biolo- 
gist, SEMs are nowadays most strongly associated with econometrics. Following 
Hoover [2008], pioneering work on structural econometric models was done in the 
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1930s by Jan Tinbergen, and the conceptual foundations of probabilistic econo- 
metrics were laid in Trgyve Haavelmo’s work [Haavelmo, 1944]. Early economists 
were trying to conceptualize the fact that unlike correlation, regression has a nat- 
ural direction. The regression of Y on X leads to a solution that usually is not the 
inverse of the regression of X on Y.° But how would the data then tell us in which 
direction we should perform the regression? This is a problem of observational 
equivalence, and it is closely related to a problem econometricians call identifica- 
tion. 

A number of early works saw a connection between what made a set of equations 
or relations structural [Frisch and Waugh, 1933], and properties of invariance and 
autonomy — according to Aldrich [1989], indeed the central notion in the pioneer- 
ing work of Frisch et al. [1948]. Here, a structural relation was aiming for more 
than merely modeling an observed distribution of data — it was trying to capture 
an underlying structure connecting the variables of the model. 

At the time, the Cowles Commission was a major economic research institute, 
instrumental in creating the field of econometrics. Its work related causality to the 
invariance properties of the structural econometric model [Hoover, 2008]. Pearl 
[2009] credits Marschak’s opening chapter of a 1950 Cowles monograph with the 
idea that structural equations remain invariant to certain changes in the system 
[Marschak, 1950]. A crucial distinction emphasized by the Cowles work was the 
one between endogenous and exogenous variables. Endogeneous variables are 
those that the modeler tries to understand, while exogenous ones are determined 
by factors outside the model, and are taken as given. Koopmans [1950] assayed 
two principles for determining what should be treated as exogeneous. The de- 
partmental principle considers variables outside of the scope of the discipline as 
exogeneous (e.g., weather is exogeneous to economics). The (preferred) causal 
principle calls those variables exogenous that influence the remaining (endoge- 
neous) variables, but are (almost) not influenced thereby. 

Haavelmo [1943] interpreted structural equations as statements about hypothet- 
ical controlled experiments. He considered cyclic stochastic equation models and 
discussed the role of invariance as well as policy interventions. Pearl [2015] gives 
an appraisal of Haavelmo’s role in the study of policy intervention questions and 
the development of the field of causal inference. In an account of causality in 


® As an aside, while most of the early works were using linear equations only, there have also been 
attempts to generalize to nonlinear SEMs [Hoover, 2008]. 
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economics and econometrics, Hoover [2008] discusses a system of the form 
Xi := Ni 
¥':=0X'+Nj, 


where the errors Ni, Ni, are i.i.d., and O is a parameter. He attributes to Simon 
[1953] the view (which does not require any temporal order) that XÍ may be re- 
ferred to as causing YŻ since one knows all about X! without knowing about Y’, but 
not vice versa. The equations also allow us to predict the effect of interventions. 
Hoover goes on to argue that one can rewrite the system reversing the roles of X! 
and Y! while retaining the property that the error terms are uncorrelated.’ He thus 
points out that we cannot infer the correct causal direction on the basis of a single 
set of data (“observational equivalence’). Experiments, either controlled or natu- 
ral, could help us decide. If, for example, an experiment can change the conditional 
distribution of Y‘ given X’, without altering the marginal distribution of X, then it 
must be that X’ causes Y’. Hoover refers to this as Simon’s invariance criterion: 
the true causal order is the one that is invariant under the right sort of intervention.® 
Hurwicz [1962] argues that an equation system becomes structural by virtue of in- 
variance to a domain of modifications. Such a system then bears resemblance to a 
natural law. Hurwicz recognized that one can use such modifications to determine 
structure, and that while structure is necessary for causality, it is not for prediction. 

Aldrich [1989] provides an account of the role of autonomy in structural equation 
modeling. He argues that autonomous relations are likely to be more stable than 
others. He equates Haavelmo’s autonomous variables with what subsequently be- 
came known as exogeneous variables. Autonomous variables are parameters fixed 
by external forces, or treated as stochastically independent.? Following Aldrich 
[1989, page 30], “the use of the qualifier autonomous and the phrase forces exter- 
nal to the sector under consideration suggest that ... the parameters of that model 
would be invariant to changes in the sectoral parameters.” He also relates invari- 
ance to a notion termed super-exogeneity [Engle et al., 1983]. 

While the early proponents of structural equation modeling already had some 
profound insights in their causal underpinnings, the developments in computer sci- 


7We shall revisit this topic in more detail in Section 4.1.3. 

8We would argue that this may not hold true if interventions are coupled to each other, for exam- 
ple, to keep the anticausal conditional (which describes the cause, given its effect) invariant. This 
could be seen as a violation of Principle 2.1 on the level of interventions. We return to this point in 
Section 2.3.4. 


° This is akin to the independence of noise terms we use in SCMs. 
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ence initially happened separately. Pearl [2009, p. 104] relates how he and his 
coworkers started connecting Bayesian networks and structural equation modeling: 
“Tt suddenly hit me that the century-old tension between economists and statisti- 
cians stems from simple semantic confusion: statisticians read structural equa- 
tions as statements about E[Y|x] while economists read them as E[Y |do (x)]. This 
would explain why statisticians claim that structural equations have no meaning 
and economists retort that statistics has no substance.” Pearl [2009, p. 22] formu- 
lates the independence principle as follows: “that each parent-child relationship in 
the network represents a stable and autonomous physical mechanism — in other 
words, that it is conceivable to change one such relationship without changing the 
others.” 

It is noteworthy, and indeed a motivation for writing the present book, that among 
the different implications of Principle 2.1, shown in Figure 2.2, most of the work 
using causal Bayesian networks only exploits the independence of noise terms.!° 
It leads to a rich structure of conditional independences [Pearl, 2009, Spirtes et al., 
2000, Dawid, 1979, Spohn, 1980], ultimately deriving from Reichenbach’s Prin- 
ciple 1.1. The other aspects of independence received significantly less attention 
[Hausman and Woodward, 1999, Lemeire and Dirkx, 2006], but there is a recent 
thread of work aiming at formalizing and using them. A major motivation for this 
has been the cause-effect problem where conditional independence is useless since 
we have only two variables (see Sections 4.1.2 and 6.10). Janzing and Schélkopf 
[2010] formalize independence of mechanism in terms of algorithmic information 
theory (Section 4.1.9). They view the functions in an SCM as representing in- 
dependent causal mechanisms that persist after manipulating the distribution of 
inputs or other mechanisms. More specifically, in the context of causal Bayesian 
networks, they postulate that the conditional distributions of all nodes given their 
parents are algorithmically independent. In particular, for the causal Bayesian net- 
work X — Y, Py and Pyy contain no algorithmic information about each other — 
meaning that knowledge of one does not admit a shorter description of the other. 
The idea that unrelated mechanisms are algorithmically independent follows from 
the generalization of SCMs from random variables to individual objects where sta- 
tistical dependences are replaced with algorithmic dependences. 

Scholkopf et al. [2012, e.g., Section 2.1.1.] discuss the question of robustness 
with respect to changes in the distribution of the cause (in the two-variable set- 


'0Certain Bayesian structure learning methods [Heckerman et al., 1999] can be viewed as imple- 
menting the independence principle by assigning independent priors to the conditional probabilities 
of each variable given its causes. 


26 Chapter 2. Assumptions for Causal Inference 


ting), and connect it to problems of machine learning; see also Chapter 5. Within 
an SCM, they analyze invariance of either the function or of the noises, for differ- 
ent learning scenarios (e.g., transfer learning, concept drift). They employ a notion 
of independence of mechanism and input that subsumes both independence un- 
der changes and information-theoretic independence (we called this the “overlap” 
between the first and second independence in Figure 2.2 in the discussion of the 
boxes): “Pgic contains no information about Pc and vice versa; in particular, if Peje 
changes at some point in time, there is no reason to believe that Pc changes at the 
same time.” 

Further links to transfer and related machine learning problems are discussed 
by Bareinboim and Pearl [2016], Rojas-Carulla et al. [2016], Zhang et al. [2013] 
and Zhang et al. [2015]. Peters et al. [2016] exploited invariance across envi- 
ronments for learning parts of the graph structure underlying a multivariate SCM 
(Section 7.1.6). 


2.3 Physical Structure Underlying Causal Models 


We conclude this chapter with some notes on connections to physics. Readers 
whose interests are limited to mathematical and statistical structures may prefer to 
skip this part. 


2.3.1 The Role of Time 


An aspect that is conspicuously missing in Section 2.1 is the role of time. Indeed, 
physics incorporates causality into its basic laws by excluding causation from fu- 
ture to past.!! This does not do away with all problems of causal inference, though. 
Already Simon [1953] recognized that while time ordering can provide a useful 
asymmetry, it is asymmetry that is important, not the temporal sequence. 
Microscopically, the time evolution of both classical systems and quantum me- 
chanical systems is widely believed to be invertible. This seems to contradict our 
intuition that the world evolves in a directed way — we believe we would be able 
to tell if time were to flow backward. The contradiction can be resolved in two 
ways. In one of them, suppose we have a complexity measure for states [Bennett, 
1982, Zurek, 1989], and we start with a state whose complexity is very low. In that 


'IMore precisely, an event can only influence events lying in its light cone since no signal can 
travel faster than the speed of light in a vacuum, according to the theory of relativity. 
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case, time evolution (assuming it is sufficiently ergodic) will tend to increase com- 
plexity. In the other way, we assume that we are considering open systems. Even 
if the time evolution for a closed system is invertible (e.g., in quantum mechanics, 
a unitary time evolution), the time evolution of an open subsystem (which interacts 
with its environment) in the generic case need not be invertible. 


2.3.2 Physical Laws 


An often discussed causal question can be addressed with the following example. 
The ideal gas law stipulates that pressure p, volume V, amount of substance n, and 
absolute temperature T satisfy the equation 


p-V=n-R-T, (2.2) 


where R is the ideal gas constant. If we, for instance, change the volume V allo- 
cated to a given amount of gas, then pressure p and/or temperature T will change, 
and the specifics will depend on the exact setup of the intervention. If, on the other 
hand, we change 7, then V and/or p will change. If we keep p constant, then we 
can, at least approximately, construct a cycle involving T and V. So what causes 
what? It is sometimes argued that such laws show that it does not make sense to 
talk about causality unless the system is temporal. In the next paragraph, we ar- 
gue that this is misleading. The gas law (2.2) refers to an equilibrium state of an 
underlying dynamical system, and writing it as a simple equation does not provide 
enough information about what interventions are in principle possible and what is 
their effect. SCMs and their corresponding directed acyclic graphs do provide us 
with this information, but in the general case of non-equilibrium systems, it is a 
hard problem whether and how a given dynamical systems leads to an SCM. 


2.3.3 Cyclic Assignments 


We think of SCMs as abstractions of underlying processes that take place in time. 
For these underlying processes, there is no problem with feedback loops, since at 
a sufficiently fast time scale, those loops will be unfolded in time, assuming there 
are no instantaneous interactions, which are arguably excluded by the finiteness of 
the speed of light. 

Even though the time-dependent processes do not have cycles, it is possible that 
an SCM derived from such processes (for instance, by methods mentioned below 
in Remarks 6.5 and 6.7), involving only quantities that no longer depend on time, 
does have cycles. It becomes a little harder to define general interventions in such 
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systems, but certain types of interventions should still be doable. For instance, 
a hard intervention where we set the value of one variable to a fixed value may 
be possible (and realizable physically by a forcing term in an underlying set of 
differential equations; see Remark 6.7). This cuts the cycle, and we can then derive 
the entailed intervention distribution. 

However, it may be impossible to derive an entailed observational distribution 
from acyclic set of structural assignments. Let us consider the two assignments 


X= fx (Y,Nx) 
Y:= fy (X,Ny) 


and noise variables Ny IL Ny. Just like in the case of acyclic models, we consider 
the noises and functions as given and seek to compute the entailed joint distribution 
of X and Y. To this end, let us start with the first assignment X := fx(Y,Ny), and 
substitute some initial Y into it. This yields an X, which we can then substitute 
into the other assignment. Suppose we iterate the two assignments and converge 
to some fixed point. This point would then correspond to a joint distribution of 
X,Y simultaneously satisfying both structural assignments as equalities of random 
variables.!* Note that we have here assumed that the same Nx, Ny are used at every 
step, rather than independent copies thereof. 

However, such an equilibrium for X,Y need not always exist, and even if it does, 
it need not be the case that it can be found using the iteration. In the linear case, 
this has been analyzed by Lacerda et al. [2008] and Hyttinen et al. [2012]; see also 
Lauritzen and Richardson [2002]. For further details see Remark 6.5. 

This observation that one may not always be able to get an entailed distribution 
satisfying two cyclic structural assignments is consistent with the view of SCMs as 
abstractions of underlying physical processes — abstractions whose domain of va- 
lidity as causal models is limited. If we want to understand general cyclic systems, 
it may be unavoidable to study systems of differential equations rather than SCMs. 
For certain restricted settings, on the other hand, it can still make sense to stay on 
the phenomenologically more superficial level of SCMs; see, for example, Mooij 
et al. [2013]. One may speculate that this difficulty inherent to SCMs (or SEMs) is 
part of the reason why the econometrics community started off viewing SEMs as 


12The fact that the assignments are satisfied as equalities of random variables means that we are 
considering an ensemble of systems that differ in the realizations of the noise variables. Each realiza- 
tion leads to a (possibly different) realization for X,Y, and thus the distribution of the noises implies 
a distribution over X,Y. 
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causal models, but later on parts of the community decided to forgo this interpre- 
tation in favor of a view of structural equations as purely algebraic equations. 


2.3.4 Feasibility of Interventions 


We have used the principle of independent mechanisms to motivate interventions 
that only affect one mechanism (or structural assignment) at a time. While real 
systems may admit such kind of interventions, there will also be interventions that 
replace several assignments at the same time. The former type of interventions 
may be considered more elementary in an intuitive physical sense. If multiple 
elementary interventions are combined, then this may in principle happen in a way 
such that they tuned to each other, and we would view this as violating a form of 
our independence Principle 2.1; see footnote 8 on page 24. One may hope that 
combined interventions that are “natural” will not violate independence. However, 
to tell whether an intervention is “natural” in this sense requires knowledge of 
the causal structure, which we do not have when trying to use such principles 
to perform causal learning in the first place. Ultimately, one can try to resort to 
physics to assay what is elementary or natural. 

The questions of which operations on a physical system are elementary plays a 
crucial role in modern quantum information theory. There, the question is closely 
related to analyzing the structure of physical interactions.'* Likewise, we believe 
that understanding physical mechanisms underlying causal relations may some- 
times explain why some interventions are natural and others are complex, which 
essentially defines the “modules” given by the different structural equations. 


2.3.5 Independence of Cause and Mechanism and the 
Thermodynamic Arrow of Time 


We provide a discussion as well as a toy model illustrating how the principle of 
independent mechanisms can be viewed as a principle of physics. To this end, we 


'3For the interested reader: A system consisting of n two-level quantum systems is described by 
the 2”-dimensional Hilbert space C? @--- @ CĈ. Unitary operators acting on this Hilbert space cor- 
respond to physical processes. For several such systems, researchers have shown how to implement 
“basic” unitaries that act on at most two of the n tensor components [Nielsen and Chuang, 2000] and 
act trivially on the remaining n— 2 ones. Then one can generate any other unitary [DiVincenzo, 1995] 
approximately by concatenation. Although this is by no means the only possible choice for the set 
of “basic” unitary operations, the choice seems natural given the structure of physical interactions. 
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Figure 2.4: Simple example of the independence of initial state and dynamical law: beam 
of particles that are scattered at an object. The outgoing particles contain information about 
the object while the incoming do not. 


consider the special case of two variables and postulate the following as a special- 
ization of Principle 2.1: 


Principle 2.2 (Initial state and dynamical law) Jf s is the initial state of a phys- 
ical system and M a map describing the effect of applying the system dynamics for 
some fixed time, then s and M are independent. Here, we assume that the initial 
state, by definition, is a state that has not interacted with the dynamics before. 


Here, the “initial” state s and “final” state M(s) are considered as “cause” and 
“effect.” Accordingly, M is the mechanism relating cause and effect. The last sen- 
tence of Principle 2.2 requires some explanation to avoid erroneous conclusions. 
We now discuss its meaning for an intuitive example. 

Figure 2.4 shows a scenario where the independence of initial state and dynamics 
is so natural that we take it for granted: a beam of n particles propagating in exactly 
the same direction are approaching some object, where they are scattered in various 
directions. The directions of the outgoing particles contain information about the 
object, while the beam of incoming particles does not contain information about it. 
The assumption that the particles initially propagate exactly in the same direction 
can certainly be weakened. Even if there is some disorder in the incoming beam, 
the outgoing beam can still contain information about the object. Indeed, vision 
and photography are only possible because photons contain information about the 
objects at which they were scattered. 

We can easily time-reverse the scenario by “hand-designing” an incoming beam 
for which all particles propagate in the same direction after the scattering process. 
We now argue how to make sense of Principle 2.2 in this case. Certainly, such a 


2.3. Physical Structure Underlying Causal Models 31 


beam can only be prepared by a machine or a subject that is aware of the object’s 
shape and then directs the particles accordingly. As a matter of fact, particles that 
have never been in contact with the object cannot a priori contain information about 
it. Then, Principle 2.2 can be maintained if we consider the process of directing 
the particles as part of the mechanism and reject the idea of calling the state of the 
hand-designed beam an initial state. Instead, the initial state then refers to the time 
instant before the particles have been given the fine-tuned momenta. 

The fact that photographic images show what has happened in the past and not 
what will happen in the future is among the most evident asymmetries between past 
and future. The preceding discussion shows that this asymmetry can be seen as an 
implication of Principle 2.2. The principle thus links asymmetries between cause 
and effect with asymmetries between past and future that we take for granted. 

After having explained the relation between Principle 2.1 and the asymmetry 
between past and future in physics on an informal level, we briefly mention that 
this link has been made more formally by Janzing et al. [2016] using algorithmic 
information theory. In the same way as Principle 4.13 formalizes independence 
of Pc and Pgjc as algorithmic independence, Principle 2.2 can also be interpreted 
as algorithmic independence of s and M. Janzing et al. [2016, Theorem 1] show 
that for any bijective M, Principle 2.2 then implies that the physical entropy of 
M(s) cannot be smaller than the entropy of s (up to an additive constant) provided 
that one is willing to accept Kolmogorov complexity (see Section 4.1.9) as the 
right formalization of physical entropy, as proposed by Bennett [1982] and Zurek 
[1989]. Principle 2.2 thus implies non-decrease of entropy in the sense of the 
standard arrow of time in physics. 


3 


Cause-Effect Models 


The present chapter formalizes some basic concepts of causality for the case where 
the causal models contain only two variables. Assuming, these two variables are 
non-trivially related and their dependence is not solely due to a common cause, 
this constitutes a cause-effect model. We briefly introduce SCMs, interventions, 
and counterfactuals. All of these concepts are defined again in the context of mul- 
tivariate causal models (Chapter 6) and we hope that encountering them for two 
variables first makes the ideas more easily accessible. 


3.1 Structural Causal Models 


SCMs constitute an important tool to relate causal and probabilistic statements. 


Definition 3.1 (Structural causal models) An SCM € with graph C > E consists 
of two assignments 


C:=Ne, (3.1) 


where Ng IL Nc, that is, Ng is independent of Nc. 


In this model, we call the random variables C the cause and E the effect. Fur- 
thermore, we call C a direct cause of E, and we refer to C —> E as a causal graph. 
This notation hopefully clarifies and coincides with the reader’s intuition when we 
talk about interventions, for example, in Example 3.2. 

If we are given both the function fg and the noise distributions Py, and Py,, we 
can sample data from such a model in the following way: We sample noise values 
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Ne, Nc and then evaluate (3.1) followed by (3.2). The SCM thus entails a joint 
distribution Pcg over C and E (for a formal proof see Proposition 6.3). 


3.2 Interventions 


As discussed in Section 1.4.2, we are often interested in the system’s behavior 
under an intervention. The intervened system induces another distribution, which 
usually differs from the observational distribution. If any type of intervention can 
lead to an arbitrary change of the system, these two distributions become unrelated 
and instead of studying the two systems jointly we may consider them as two sep- 
arate systems. This motivates the idea that after an intervention only parts of the 
data-generating process change. For example, we may be interested in a situation in 
which variable E is set to the value 4 (irrespective of the value of C) without chang- 
ing the mechanism (3.1) that generates C. That is, we replace the assignment (3.2) 
by E := 4. This is called a (hard) intervention and is denoted by do (E := 4). The 
modified SCM, where (3.2) is replaced, entails a distribution over C that we denote 
by pe or | ia. where the latter makes explicit that the SCM € was 
our starting point. The corresponding density is denoted by c++ p4°(=*)(c) or, in 
slight abuse of notation, p? E=® (c).! However, manipulations can be much more 
general. For example, the intervention do (E := gg (C) + Ng) keeps a functional 
dependence on C but changes the noise distribution. This is an example of a soft 
intervention. We can replace either of the two equations. 
The following example motivates the namings “cause” and “effect”: 


Example 3.2 (Cause-effect interventions) Suppose that the distribution Pc g is 
entailed by an SCM € 


C:=Nc 
E:=4-C+Nerp, (3.3) 
with Nc, Ng n (0,1), and graph C > E. Then, 


PE = N (0,17) AN (8,1) = PEE = PE o 
" do(C:=3 
AN (12,1) = PEW) = PË oz. 


ln the literature, the notation p(c |do (E := 4)) is also commonly used. We prefer pll(E=) since 
interventions are conceptually different from conditioning, and p(c|do(E := 4)) resembles the usual 
notation for the latter, p(c| E = 4). 
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Intervening on C changes the distribution of E. But on the other hand, 
peers) =W= ps _ pe ee (+ Piga) , (3.4) 


No matter how strongly we intervene on E, the distribution of C remains what it 
was before. This model behavior corresponds well to our intuition of C “caus- 
ing” E: for example, no matter how much we whiten someone’s teeth, this will not 
have any effect on this person’s smoking habits. (Importantly, the conditional dis- 
tribution of C given E = 2 is different from the distribution of C after intervening 
and setting E to 2.) 

The asymmetry between cause and effect can also be formulated as an indepen- 
dence statement. When we replace the assignment (3.3) with E := Ng (think about 
randomizing E£), we break the dependence between C and E. In 
C;do( E:=N 

Pog ae 
we find C IL E. This independence does not hold when randomizing C. As long as 
var[Nc] £ 0, we find C X E in 


€;do(C:=Nc) . 
C,E > 


the correlation between C and E remains non-zero. 


Code Snippet 3.3 The code samples from the SCM described in Example 3.2. 


set.seed(1) 

# generates a sample from the distribution entailed by the SCM 

C <- rnorm(300) 

E <- 4*C + rnorm(300) 

c(mean(E), var(E)) 

# [1] 0.1236532 16.1386767 

# 

# generates a sample from the intervention distribution do(C:=2); 
# this changes the distribution of E 

C <- rep(2,300) 

E <- 4*C + rnorm(300) 

c(mean(E), var(E)) 

# [1] 7.936917 1.187035 

# 

# generates a sample from the intervention distribution do(E:=N"); 
# this breaks the dependence between C and E 

C <- rnorm(300) 

E <- rnorm(300) 

cor.test(C,E)$p.value 

# [1] 0.2114492 


36 Chapter 3. Cause-Effect Models 


3.3 Counterfactuals 


Another possible modification of an SCM changes all of its noise distributions. 
Such a change can be induced by observations and allows us to answer counter- 
factual questions. To illustrate this, imagine the following hypothetical scenario: 


Example 3.4 (Eye disease) There exists a rather effective treatment for an eye 
disease. For 99% of all patients, the treatment works and the patient gets cured (B = 
0); if untreated, these patients turn blind within a day (B = 1). For the remaining 
1%, the treatment has the opposite effect and they turn blind (B = 1) within a day. 
If untreated, they regain normal vision (B = 0). 

Which category a patient belongs to is controlled by a rare condition (Ng = 1) 
that is unknown to the doctor, whose decision whether to administer the treatment 
(T = 1) is thus independent of Ng. We write it as a noise variable Nr. 

Assume the underlying SCM 


T ¿= Nr 


e R= TtT 


(3.5) 


with Bernoulli distributed Ng ~ Ber(0.01); note that the corresponding causal 
graph is T > B. 

Now imagine a specific patient with poor eyesight comes to the hospital and goes 
blind (B = 1) after the doctor administers the treatment (T = 1). We can now ask 
the counterfactual question “What would have happened had the doctor admin- 
istered treatment T = 0?” Surprisingly, this can be answered. The observation 
B =T = 1 implies with (3.5) that for the given patient, we had Ng = 1. This, in 
turn, lets us calculate the effect of do (T := 0). 

To this end, we first condition on our observation to update the distribution over 
the noise variables. As we have seen, conditioned on B = T = 1, the distribution 
for Ng and the one for Ny collapses to a point mass on 1, that is, ô. This leads to 
a modified SCM: 


= | 


T 
PSIE = riirii 


(3.6) 
Note that we only update the noise distributions; conditioning does not change the 
structure of the assignments themselves. The idea is that the physical mechanisms 
are unchanged (in our case, what leads to a cure and what leads to blindness), but 
we have gleaned knowledge about the previously unknown noise variables for the 
given patient. 
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Next, we calculate the effect of do (T = 0) for this patient: 


T = 0 


€B=1, 7 = 1;do(T := 0): B = T 


(3.7) 


Clearly, the entailed distribution puts all mass on (0,0), and hence 
p€lB=1.T=1;do(T:=0) (B — 0) =i 


This means that the patient would thus have been cured (B = 0) if the doctor had 
not given him treatment, in other words, do (T := 0). Because of 


podo(T:=1) (B=0)=0.99 and 
p&do(T:=0) (B=0)=0.01, 


however, we can still argue that the doctor acted optimally (according to the avail- 
able knowledge). 


Interestingly, Example 3.4 shows that we can use counterfactual statements to 
falsify the underlying causal model (see Section 6.8). Imagine that the rare con- 
dition Ng can be tested, but the test results take longer than a day. In this case, 
it is possible that we observe a counterfactual statement that contradicts the mea- 
surement result for Ng. The same argument is given by Pearl [2009, p.220, point 
(2)]. Since the scientific content of counterfactuals has been debated extensively, it 
should be emphasized that the counterfactual statement here is falsifiable because 
the noise variable is not unobservable in principle but only at the moment when the 
decision of the doctor has to be made. 


3.4 Canonical Representation of Structural Causal 
Models 


We have discussed two types of causal statements both entailed by SCMs: first, 
the behavior of the system under potential interventions, and second, counterfac- 
tual statements. To further understand the difference between them, we introduce 
the following “canonical representation” of an SCM.” According to the structural 
assignment 

E = fr(C,Ne), 


This representation has been used in the literature in various places, for example, [Pearl, 2009] 
although we have not found the term “canonical representation.” 
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for each fixed value ng of the noise Ng, E is a deterministic function of C: 


In order words, if C and E attain values in C and €, respectively, then the noise Ng 
switches between different functions from C to €. Without loss of generality, we 
may therefore assume that Ng attains values in the set of functions from C to €, 
denoted by €°. Using this convention, we can also rewrite (3.8) as 


and call this the canonical representation of the structural equation relating C and E. 

Let us now explain why two SCMs with different canonical representations may 
induce the same interventional probabilities, although they differ in their counter- 
factual statements. To this end, we restrict the attention to the case where C attains 
values in the finite set C = {1,...,k}. Then the set of functions from C to £ is given 
by the k-fold Cartesian product 


EF HE Ke KE, 


k times 


where the jth component describes which value E attains for C = j. Accordingly, 
the distribution Py, is given by a joint distribution on €* whose marginal distri- 


bution of the jth component determines the conditional Pgjc=j. Since C is the 


cause and E the effect, we have doe = Pric=j; in other words, here interven- 


tional probabilities and observational conditional probabilities coincide. Thus, the 
interventional causal implications of the SCM are completely determined by the 
marginal distributions of each component of the vector-valued noise variable Ng 
even though the SCM includes a precise specification of Py,, that is, the joint dis- 
tribution of all components. While the statistical dependences between the compo- 
nents of the noise variable Ng referring to the effect are irrelevant for interventional 
causal statements, they do matter for counterfactual statements. To see this, let C 
and E be binary, that is, C = € = {0,1}. The set of functions from {0,1} to {0,1} 
reads Ef = {0,1,ID, NOT} where 0,1 denote the constant functions attaining 0 
and 1, respectively, and ID and NOT denote identity and negation, respectively. 
To construct two different distributions Pr, and Py. inducing the same conditional 
Pgjc=0; Pe|c=1, first choose the uniform mixture of 0 and 1 and second the uniform 
mixture of ID and NOT. In both cases, C and E are statistically independent and 
the distribution of E is unaffected by interventions on C because E remains an un- 
biased coin toss regardless of C. In the Cartesian product representation, the four 
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functions read EC = {(0,0), (1,1), (0,1), (1,0)}, the first and the second compo- 
nent denote the images of C = 0 and C = 1, respectively. Obviously, the uniform 
mixture of (0,0) and (1,1) and the uniform mixture of (0,1) and (1,0) both in- 
duce the same marginal distributions on the first and the second component of the 
Cartesian product — in agreement with our remark that they induce the same in- 
tervention distributions. The counterfactual statement “E would have attained a 
different value if C had been set to a different one,’ however, is true only for the 
mixture of ID and NOT, but not for the mixture of 0 and 1. Hence, counterfactual 
statements depend not only on the marginal distributions of the components of the 
noise variable Nz, but also on the statistical dependences between the Cartesian 
product components. 

Note that two formally different SCMs may induce not only the same interven- 
tional distribution but even imply the same counterfactual statements: Given the 
assignment 


E := fr(C,Ne), 
reparameterizations of Ng are obviously irrelevant. More explicitly, we may set 
E := fe(C, Ñe) = fe(C,g'(Ne)), 


for some bijection g on the range of Ng and redefine the noise variable by Ñg := 
g(Ng). Using the canonical representation (3.9), we got rid of this additional degree 
of freedom that would have confused this discussion of counterfactuals. 


3.5 Problems 
Problem 3.5 (Sampling from an SCM) Consider the SCM 
X := Y? +Ny (3.10) 


Y := Ny (3.11) 


with Ny, Ny a A(O, 1). Generate an i.i.d. sample of size 200 from the joint distri- 
bution (X,Y). 


Problem 3.6 (Conditional distributions) Show that P g- In Equation (3.4) is 
a Gaussian distribution: 


8 1 
a n (Fo 7) 
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Problem 3.7 (Interventions) Assume that we know that a process either follows 
the SCM 


X :=Y +Nx 
Y := Ny, 


where Ny ~ N (Ux, 0%) and Ny ~ N (Ux, 67) with unknown y, Uy and ox , oy > 
0, or it follows the SCM 


X := My 
Y :=X +My, 


where My ~ N (Vx, T2) and My ~ N (vy, T2) with unknown vy, vy and Tx, Ty > 0. 
Is there a single intervention distribution that lets you distinguish between the two 
SCMs? 


Problem 3.8 (Cyclic SCMs) We have mentioned that if the assignments inherit 
a cyclic structure, the SCM does not necessarily induce a unique distribution over 
the observed variables. Sometimes there is no solution and sometimes it is not 
unique. 


a) We first look at an example that induces a unique solution. Consider the 
SCM 


X27 in, (3.12) 
Y :=2-X+Ny (3.13) 
with (Nx,Ny) ~P for an arbitrary distribution P. Compute a,B,y,6 such 
that 
X := QNg +PNy 
Y := yNy + Ny 
yields a solution (X,Y,Nx ,Ny ) of the SCM; that is, the vector satisfies Equa- 


tions (3.12) and (3.13). The solution can be seen as a special case of Equa- 
tion (6.2). 


b) Consider the SCM 


X :=Y +Nxy 
Y := X +Ny 
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with (Ny,Ny) ~ P. Show that if P allows for a density with respect to 
Lebesgue measure and factorizes, that is, Nx JL Ny, then there is no solu- 
tion (X ,Y,Ny,Ny) of the SCM. 

Furthermore, construct a distribution P, and a vector (X,Y,Nx,Ny) that 
solves the SCM. 


4 


Learning Cause-Effect Models 


Readers who are familiar with the conditional statistical independence-based ap- 
proach to causal discovery from observational data [Pearl, 2009, Spirtes et al., 
2000] may be surprised by a chapter discussing causal inference for the case of 
only two observed variables, that is, a case where no non-trivial conditional in- 
dependences can hold. This chapter introduces assumptions under which causal 
inference with just two observed variables is possible. 

Some of these assumptions may seem too strong to be realistic, but one should 
keep in mind that empirical inference, even if it is not concerned with causal prob- 
lems, requires strong assumptions. This is true in particular when it deals with 
high-dimensional data and low sample sizes. Therefore, oversimplified models are 
ubiquitous and they have been proven helpful in many learning scenarios. 

The list of assumptions is diverse and we are certain that it is incomplete, too. 
Current research is still in a phase of exploring the enormous space of assump- 
tions that yield identifiability between cause and effect. We hope that this chapter 
inspires the reader who may then add other — hopefully realistic — assumptions 
that can be used for learning causal structures. 

We provide the assumptions and theoretical identifiability results in Section 4.1; 
Section 4.2 shows how these results can be used for structure identification in the 
case of a finite amount of data. 
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4.1 Structure Identifiability 


4.1.1 Why Additional Assumptions Are Required 


In Chapter 3, we introduced SCMs where the effect E is computed from the cause C 
using a function assignment. One may wonder whether this asymmetry of the data- 
generating process (i.e., that E is computed from C and not vice versa) becomes 
apparent from looking at Pcg alone. That is, does the joint distribution Py y of two 
variables X,Y tell us whether it has been induced by an SCM from X to Y or from 
Y to X? In other words, is the structure identifiable from the joint distribution? 
The following known result shows that the answer is “no” if one allows for general 
SCMs. 


Proposition 4.1 (Non-uniqueness of graph structures) For every joint distribu- 
tion Py y of two real-valued variables, there is an SCM 


Y = fy(X,Ny), X IL Ny, 
where fy is a measurable function and Ny is a real-valued noise variable. 


Proof. Analogously to Peters [2012, Proof of Proposition 2.6], define the condi- 
tional cumulative distribution function 


Fy O) = P(Y <y|X =x). 


Then define 
fy (x ny) := Ky, (ny), 


where a (ny) :=inf{x ER : Fy\,(x) = ny }. Then, let Ny be uniformly distributed 


on [0,1] and independent of X. 


The result can be applied to the case X = C and Y = E as well as to the case 
X =E and Y =C, thus every joint distribution Py y admits SCMs in both direc- 
tions. For this reason, it is often thought that the causal direction between just two 
observed variables cannot be inferred from passive observations alone. We will 
see in Chapter 7 that this claim fits into a framework in which causal inference is 
based on (conditional) statistical independences only [Spirtes et al., 2000, Pearl, 
2009]. Then, the causal structures X — Y and Y —> X are indistinguishable. For 
just two variables, the only possible (conditional) independence would condition 
on the empty set, which does not render X and Y independent unless the causal 
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influence is non-generic.'! More recently, this perspective has been challenged by 
approaches that also use information about the joint distribution other than condi- 
tional independences. These approaches rely on additional assumptions about the 
relations between probability distributions and causality. 

The remaining part of Section 4.1 discusses under which assumptions the graph 
structure can be recovered from the joint distribution (structure identifiability). 
Section 4.2 then describes methods that estimate the graph from a finite data set 
(structure identification). These statistical methods do not need to be motivated by 
the proofs of the identifiability results. Methods that follow the proofs closely are 
often inefficient in making use of the data. 


4.1.2 Overview of the Type of Assumptions 


A Priori Restriction of the Model Class One possible approach to distinguish 
cause and effect is to define a class of “particularly natural” conditionals? Pric 
and marginals Pc. For several such classes, there are theoretical results showing 
that “generic” combinations of marginals Py and conditionals Pyy induce joint 
distributions that cannot be described by the same class when X and Y are swapped. 
Statements of this kind are also called identifiability results and we will see such 
examples in the remainder of Section 4.1. 

For example, one may define classes of conditionals Pg\c and marginals Pc by 
restricting the class of functions fg; see (3.2), and/or the class of noise distribu- 
tions in (3.1) and (3.2), as will be discussed in Sections 4.1.3-4.1.6. This approach 
seems particularly natural from a machine learning perspective, where restricting 
the complexity of functions appears everywhere in standard tasks such as regres- 
sion and classification. Note that inferring causal directions via restricted function 
classes implicitly assumes that the noise variables are still independent, in agree- 
ment with the definition of an SCM (see Definition 3.1). In this sense, one could 
say that these methods employ the independence of noise according to Figure 2.2, 
but keep in mind that independence of noise renders causal directions only identi- 
fiable after restricting the function class (see Proposition 4.1). 

Another option of classes can be found in Sun et al. [2006], Janzing et al. [2009b], 


'Note that this non-generic case should not be called “trivial” because non-trivial counterfactual 
influence can be consistent with X IL Y (see Section 3.4). 

2We use the notation Pgjc as a shorthand for the collection (Pr|c=c)e of conditional distributions 
and implicitly assume the existence of a density, in other words, that Pg ç is absolutely continuous 
with respect to a product measure. 
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and Comley and Dowe [2003]. Sun et al. [2006] and Janzing et al. [2009b], for in- 
stance, consider second-order exponential models, for which the logarithmic den- 
sities of Pgjc and Pe are second order polynomials in e and c (up to a partition 
function), or in c, respectively. 

We conclude this part with two questions: First, how should one define model 
classes that describe a reasonable fraction of empirical data in real life? Second, 
given that an empirical distribution admits such a model in exactly one direction, 
why should this be the causal one? The first question is actually not specific to the 
problem of causal inference; constructing functions that describe relations between 
observed variables always requires us to fit functions from a “reasonable” class. 
The second question appears to be among the deepest problems concerning the 
relation between probability and causality. We are only able to give some intuitive 
and vague ideas, which now follow. 

We start by providing an intuitive motivation that is related to the reason why 
usual machine learning relies on restricted model classes. Whenever we find a 
model from a small function class that fits our limited number of data, we expect 
that the model will also fit future observations, as argued in Chapter 1. Hence, 
finding models from a small class that fit data is crucial for the ability to gen- 
eralize to future observations. Formally, learning causal models is substantially 
different from the usual learning scenario because it aims at inferring a model that 
describes the behavior of the system under interventions and not just observations 
taken from the same distribution. Therefore, there is no straightforward way to 
adopt arguments from statistical learning theory, to obtain a learning theory for 
causal relations. Nevertheless, we believe that finding a model from a small class 
suggests — up to some error probability — that the model will also hold under 
different background conditions. We further believe that models that hold under 
many different background conditions are more likely to be causal than models 
that just fit observations from a single data set (see “Different Environments” in 
Section 7.1.6). This way, cause-effect inference via restricting the model class is 
vaguely related to ideas from statistical learning theory although drawing the exact 
link has to be left to the future. The preceding informal arguments for using causal 
models from small classes should not be mistaken as stating that causal relations 
in nature are indeed simple. The question whether or not we will often succeed 
in fitting data with simple functions, is a completely different question. We only 
argue for the belief that if there is a simple function that fits the data, it is more 
likely to also describe a causal relation. Furthermore, we will draw one connection 
between restricted model classes and the independence of cause and mechanism 
in Section 4.1.9. To be prepared for those quite formal derivations, we first pro- 
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vide a rather unrealistic toy model that we consider more a metaphor than a serious 
example. 


Independence of Cause and Mechanism Section 2.1 describes the idea that Pe 
and Pg\c correspond to two independent mechanisms of nature. Therefore, they 
typically contain no information about each other (cf. Principle 2.1 and the middle 
box in Figure 2.2). Naturally, postulating that Po and Pr)c are independent in the 
sense that they do not contain information about each other raises the question 
of what type of information is meant. There is no obvious sense in which the 
postulate can be formalized by a condition that could be checked by a statistical 
independence test. This is because we are talking about a scenario where one fixed 
joint distribution Pc g is visible and not a collection of distributions in which we 
could check whether the distribution of the hypothetical cause and the distribution 
of the hypothetical effect, given the cause, change in a dependent way (this is 
essentially the difference between the left and the middle boxes in Figure 2.2). To 
translate the independence of cause and mechanism into the language of SCMs, we 
assume that the distribution of the cause should be independent of the function and 
the noise distribution representing the causal mechanism. Note that this is, again, 
a priori, not a statement about statistical independence. Instead, it states that fg 
and Py, contain no information about Pc and vice versa. This fact can only be used 
for causal inference if the independence is violated for all structural models that 
describe Pc g from E to C. 

Sections 4.1.7 and 4.1.8 describe two toy scenarios for which well-defined no- 
tions of independence versus dependence can be given. Finally, in Section 4.1.9, 
we describe a formalization of independence of Pe and Ppjc that is applicable to 
more general scenarios rather than being restricted to the simple toy scenarios in 
Sections 4.1.7 and 4.1.8. Here, dependence is measured by means of algorithmic 
mutual information, a concept that is based on description length in the sense of 
Kolmogorov complexity. Since the latter is uncomputable, it should be consid- 
ered as a philosophical principle rather than a method. Its practical relevance is 
two-fold. First, it may inspire the development of new methods and, second, jus- 
tifications of existing methods can be based on it. For instance, the independence 
principle can justify inference methods based on an a priori restriction of the model 
class; see Section 4.1.9 for a specific example. To get a rough intuition about how 
independence is related to restricted model classes, consider a thought experiment 
where Po is randomly chosen from a class of k different marginal distributions. 
Likewise, assume that Pgjc is chosen from another class of £ different conditional 
distributions. This induces k- £ different joint distributions Pecz. In the generic case 
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(unless the classes are defined in a rather special way), this yields k- £ > k different 
marginals Pg and k- 4 > £ different conditionals Pez. Hence, typical combinations 
of Pc and Pgic induce joint distributions Pg,c for which the “backward marginal 
and conditional” Pg and Pog will not be in the original classes and would require 
larger model classes instead. In other words, no matter how large one chooses the 
set of possible Pc and Pgjc, the set of induced Pez and Pg is even larger. This 
thought experiment is more like a metaphor because it is based on the naive picture 
of randomly choosing from a finite set. Nevertheless, it motivates the belief that in 
the causal direction, marginals and conditionals are more likely to admit a descrip- 
tion from an a priori chosen small set provided that the latter has been constructed 
in a reasonable way. 

Sections 4.1.3 to 4.1.6 describe model assumptions with a priori restriction of 
the model class, while Sections 4.1.7 to 4.1.9 formalize an independence assump- 
tion. Section 4.1.9, however, plays a special role because it should be considered a 
foundational principle rather than an inference method in its own right. 


4.1.3 Linear Models with Non-Gaussian Additive Noise 


While linear structural equations with Gaussian noise have been extensively stud- 
ied, it has been observed more recently [Kano and Shimizu, 2003, Shimizu et al., 
2006, Hoyer et al., 2008a] that linear non-Gaussian acyclic models (LiNGAMs) 
allow for new approaches to causal inference. In particular, the distinction be- 
tween X causes Y and Y causes X from observational data becomes feasible. The 
assumption is that the effect E is a linear function of the cause C up to an additive 
noise term: 
E=QC+Nze, Nz ILC, 


with œ € R (which is a special case of additive noise models introduced in Sec- 
tion 4.1.4). The following result shows that this assumption is sufficient for identi- 
fying cause and effect. 


Theorem 4.2 (Identifiability of linear non-Gaussian models) Assume that Py y 
admits the linear model 


Y=ax+Ny, Ny LX, (4.1) 


with continuous random variables X, Ny, and Y. Then there exist P € R and a 
random variable Ny such that 


X=BY+Ny, Ny LY, (4.2) 


if and only if Ny and X are Gaussian. 
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Figure 4.1: Joint density over X and Y for an identifiable example. The blue line is the 
function corresponding to the forward model Y := 0.5 -X + Ny, with uniformly distributed 
X and Ny; the gray area indicates the support of the density of (X,Y). Theorem 4.2 states 
that there cannot be any valid backward model since the distribution of (X,Ny) is non- 
Gaussian. The red line characterized by (b,c) is the least square fit minimizing E[X — 
bY —c]*. This is not a valid backward model X = bY +c + Ny since the resulting noise 
Ny would not be independent of Y (the size of the support of Ny would differ for different 
values of Y). 


Hence, it is sufficient that C or Ng are non-Gaussian to render the causal direction 
identifiable; see Figure 4.1 for an example. 

Let us look into slightly more details on how this result is proved. Theorem 4.2 
is the bivariate case of the model class LINGAM introduced by Shimizu et al. 
[2006], who prove a multivariate version of Theorem 4.2 using independent com- 
ponent analysis (ICA) [Comon, 1994, Theorem 11]. The proof of ICA is based 
on a characterization of the Gaussian distribution that was proved independently 
by Skitovié and Darmois [Skitovié, 1954, 1962, Darmois, 1953] and that we now 
state. 


Theorem 4.3 (Darmois-Skitovic) Let X,,...,Xq be independent, non-degenerate 
random variables (see Appendix A.1). If there exist non-vanishing coefficients 


a,..-,aq and b,...,bq (that is, for all i, a; #0 Æ bi) such that the two linear 
combinations 

l =a1ıXı +... +44X4, 

h = bıXı +... + baXa 


are independent, then each X; is normally distributed. 
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It turns out that one can prove the bivariate version stated in Theorem 4.2 as a 
short and direct consequence from the theorem of Darmois-Skitovi¢; for illustra- 
tion purposes we attach this proof in Appendix C.1. Furthermore, it can be shown 
that the identifiability of bivariate SCMs generalizes to identifiability of multivari- 
ate SCMs [Peters et al., 2011b]. With this result, the multivariate identifiability of 
LiNGAM then follows from Theorem 4.2. 

Linear models with non-Gaussian additive noise can also be applied to a prob- 
lem that sounds uncommon from the perspective of machine learning but that is in- 
teresting from the perspective of theoretical physics: estimating the arrow of time 
from data. Peters et al. [2009b] show that autoregressive models are time-reversible 
if and only if the noise variables are normally distributed. To explore asymmetries 
of empirical time series, they infer the time direction by fitting two autoregressive 
models, one from the past to the future, as standard, and one from the future to 
the past. In their experiments, the noise variables for the former direction indeed 
tend to be more independent than in the inverted time direction (cf. Section 4.2.1). 
Bauer et al. [2016] extend the idea to multivariate time series. Janzing [2010] links 
this observed asymmetry to the thermodynamic arrow of time, which suggests that 
asymmetries between cause and effect discussed in this book are also related to 
fundamental questions in statistical physics. 


4.1.4 Nonlinear Additive Noise Models 


We now describe additive noise models (ANMs), a less extreme restriction of the 
class of SCMs that is still strong enough to render cause-effect inference feasible. 


Definition 4.4 (ANMs) The joint distribution Py y is said to admit an ANM from 
X to Y if there is a measurable function fy and a noise variable Ny such that 


Y=fy(X)+Ny, Ny LX. (4.3) 
By overloading terminology, we say that Py|x admits an ANM if (4.3) holds. 


The following theorem shows that “generically,” a distribution does not admit an 
ANM in both directions at the same time. 


Theorem 4.5 (Identifiability of ANMs) For the purpose of this theorem, let us 
call the ANM (4.3) smooth if Ny and X have strictly positive densities py, and px, 
and fy, pny, and px are three times differentiable. 

Assume that Pyy admits a smooth ANM from X to Y, and there exists a y E€ R 
such that 


(log py, )" Y — fr (x)) fy (x) 40 (4.4) 
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for all but countably many values x. Then, the set of log densities log px for which 
the obtained joint distribution Py y admits a smooth ANM from Y to X is contained 
in a 3-dimensional affine space. 


Proof. (Sketch of the idea. For details, see Hoyer et al. [2009]) The ANM from 
Y to X, given by 


P(x,y) = py (y) ny (x — fx(y)), (4.5) 
implies 
log p(x, y) = log py (y) + log pny (x — fx(y)). 


One can show that log p(x,y) then satisfies the following differential equation: 


2 2 
d d“ log p(x,y)/ðx =i (4.6) 
Ax \ 0? log p(x, y)/(Axdy) 
On the other hand, the ANM from X to Y reads 
P(X,y) = pu œ) Pm O — fr (x). (4.7) 
Taking the logarithm of (4.7) yields 
log p(x,y) = log px (x) + log pm (Y — fy (x))- (4.8) 


Applying (4.6) to (4.8) yields a differential equation for the third derivative of 
log px in terms of (first, second, and third) derivatives of fx and log py,. Thus, fy 
and py, (which are properties of the conditional Pyy) determine log py up to the 
three free parameters log pny (V), (log pny )/(V), and (log py, )(V) for an arbitrary 
point v. 


Theorem 4.5 states identifiability in the “generic” case, where “generic” is char- 
acterized by complicated conditions such as (4.4) and the three-dimensional sub- 
space. For the case where px and py, is Gaussian, there is a much simpler iden- 
tifiability statement saying that only linear functions f generate distributions that 
admit an ANM in backward direction [see Hoyer et al., 2009, Corollary 1]. Fig- 
ure 4.2 visualizes two “non-generic” examples of bivariate distributions that admit 
additive noise models in both directions. First, the obvious case of a bivariate Gaus- 
sian and, second, a sophisticated one that requires fine-tuning between py and Ny 
[Mooij et al., 2016]. 

To relate Theorem 4.5 to causal semantics, assume first that we know a priori that 
the joint distribution Py y of cause and effect admits an ANM from C to E, but we 
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Figure 4.2: Joint density over X and Y for two non-identifiable examples. The left panel 
shows the linear Gaussian case and the right panel shows a slightly more complicated 
example, with “fine-tuned” parameters for function, input, and noise distribution (the latter 
plot is based on kernel density estimation). The blue function fy corresponds to the forward 
model Y := fy (X) + Ny, and the red function fy to the backward model X := fx(Y)+Ny. 


do not know whether X = C and Y = E or vice versa. Theorem 4.5 then states that 
generically there will not be an ANM from E to C, and we can thus easily decide 
which one of the variables is the cause C. 

In general, however, conditionals Pg\c in nature are not so strongly restricted that 
they necessarily admit an ANM. But is it possible that Pc and Pgjc then induce a 
joint distribution Pc g that admits an ANM from E to C? (In this case, we would 
infer the wrong causal direction.) We argue in Section 4.1.9 that this is unlikely if 
Pc and Pgic are independently chosen. 


4.1.5 Discrete Additive Noise Models 


Additive noise can be defined not solely for real-valued variables, but for any vari- 
able that attains values in a ring. Peters et al. [2010, 2011a] introduce ANMs for 
the rings? Z and Z/mZ. That is, the set of integers and the set of integers modulo 
m € Z. In the latter ring, we identify numbers that have the same remainder after 
division by m. For example, both integers 132 and 4 have the remainder (namely 4) 
after dividing by 8 and we write 132 =4 mod 8. Such a modular arithmetic may 
be appropriate when one of the domains inherits a cyclic structure. If we consider 
the day of the year, for example, we may want the days December 31 and January 
1 to have the same distance as August 25 and August 26. 


3Ina ring, we can perform addition and multiplication. The latter operation does not necessarily 
have an inverse, though. 
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As in the continuous case, we can show that in the generic case, a joint distribu- 
tion admits an ANM in at most one direction. The following result considers the 
example of the ring Z. 


Theorem 4.6 (Identifiability of discrete ANMs) Assume that a distribution Px y 
allows for an ANM Y = f(X) + Ny from X to Y and that either X or Y has finite 
support. Py y allows for an ANM from Y to X if and only if there exists a disjoint 
decomposition (hes C; = suppX, such that the following conditions a), b), and c) 
are satisfied: 


a) The C,;’s are shifted versions of each other 


Vidd; > 0 : Ci = CO + di 


and f is piecewise constant: f |c,= ci Vi. 


b) The probability distributions on the C;s are shifted and scaled versions of 
each other with the same shift constant as above: For x € Ci, P(X =x) 
satisfies 


c) The sets ci +suppNy := {ci+h : P(Ny = h) > 0} are disjoint sets. 


(By symmetry, such a decomposition satisfying the same criteria also exists for 
the support of Y.) Figure 4.3 shows an example that allows an ANM in both direc- 
tions [Peters et al., 201 la]. 

There are similar results available for discrete ANMs modulo m. We refer to 
Peters et al. [2011a] for all details; we would like to mention, however, that the 
uniform noise distribution plays a special role: Y = f(X)+Ny modm with a 
noise variable that is uniformly distributed on {0,...,m-— 1} leads to independent 
X and Y and therefore allows an ANM from Y to X, too. 

A discrete ANM imposes strong assumptions on the underlying process that are 
often violated in practice. As in the continuous case, we want to argue that if the 
process allows for a discrete ANM in one direction, it might be reasonable to infer 
that direction as causal (see also Section 4.1.9). 


4.1.6 Post-nonlinear Models 


A more general model class than the one presented in Section 4.1.4 has been an- 
alyzed by Zhang and Hyvärinen [2009]; see also Zhang and Chan [2006] for an 
early reference. 


54 Chapter 4. Learning Cause-Effect Models 


J o (J o o |-@ o 8 ( 
ci 4 o o e o © © e o- 
| o e ° e J e @ o 
+ e o e e oO o © @ 
co e e . e o e o e i 
| © e e e o e o o - 
T T i T TT ae eT j > X 


a a a3 a4 a5 a a7 ag 


Figure 4.3: Only carefully chosen parameters allow ANMs in both directions (radii cor- 
respond to probability values); see Theorem 4.6. The sets described by the theorem are 
Co = {a1,a2,...,ag} and C1 = {b1,b2,...,bg}. The function f takes the values co and c1 
on Co and C4, respectively. 


Definition 4.7 (Post-nonlinear models) The distribution Py y is said to admit a 
post-nonlinear model if there are functions fy,gy and a noise variable Ny such 
that 


Y =gr(fr(X)+Ny), Ny ILX. (4.9) 


The following result essentially shows that a post-nonlinear model exists at most 
in one direction except for some “rare” non-generic cases.* 


Theorem 4.8 (Identifiability of post-nonlinear models) Let Py y admit a post- 
nonlinear model from X to Y as in (4.9) such that px, fy, gy are three-times differ- 
entiable. Then it admits a post-nonlinear model from Y to X only if px, fy, gy are 
adjusted to each other in the sense that they satisfy a differential equation described 
in Zhang and Hyvärinen [2009]. 


4.1.7 Information-Geometric Causal Inference 


To provide an idea of how independence between Pgjc and Pc can be formalized, 
this section describes information-geometric causal inference (IGCI). IGCI, in par- 
ticular the simple version described here, is a highly idealized toy scenario that 
nicely illustrates how independence in one direction implies dependence in the 


*Here, “rare” should not be mistaken as saying that there are only finitely many exceptions. 
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other direction [DaniuSis et al., 2010, Janzing et al., 2012]. It relies on the (ad- 
mittedly strong) assumption of a deterministic relation between X and Y in both 
directions; that is, 
Y= f(X) and X=f!(Y). 

In other words, the noise variable in (3.2) is constant. Then the principle of in- 
dependence of cause and mechanism described in Section 4.1.2 reduces to the 
independence of Py and f. Remarkably, this independence implies dependence 
between Py and f—!. To show this, we consider the following special case of the 
more general setting of DaniuSis et al. [2010]. 


Definition 4.9 IGCI model) Here, Px y is said to satisfy an IGCI model from X to 
Y if the following conditions hold: Y = f(X) for some diffeomorphism f of [0,1] 
that is strictly monotonic and satisfies f(0) =0 and f(1) = 1. Moreover, Py has 
the strictly positive continuous density px, such that the following “independence 
condition” holds: 


cov[log f’, px] =0, (4.10) 


where log f’ and px are considered as random variables on the probability space 
[0, 1] endowed with the uniform distribution.® 


Note that the covariance in (4.10) is explicitly given by 
1 1 1 
cov|log f’, px] =F log f'(x)px(x)de— f log f(x)dx | px(x)dx 
0 0 0 
1 1 
= I log f'(x)px (x)dx -f log f(x)dx. 
0 0 


The following result is shown in Daniu8is et al. [2010] and Janzing et al. [2012]. 


Theorem 4.10 (Identifiability of IGCI models) Assume the distribution Px y 
admits an IGCI model from X to Y. Then the inverse function f~! satisfies 


cov[log f~", py] > 0, (4.11) 


with equality if and only if f is the identity. 


5A function is called a diffeomorphism if it is differentiable and bijective and it has a differentiable 
inverse. 

This view may be unexpected, but recall that random variables are defined as measurable func- 
tions on a probability space. Here, both log f’ and py are functions of x € {0, 1], thus they are random 
variables on the common probability space [0,1]. Therefore, any distribution on (0, 1] defines a joint 
distribution of these random variables. 
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Figure 4.4: Visualization of the idea of IGCI: Peaks of py tend to occur in regions where f 
has small slope and f~! has large slope (provided that py has been chosen independently of 
f). Thus py contains information about f—!. IGCI can be generalized to non-differentiable 
functions f [Janzing et al., 2015]. 


In other words, uncorrelatedness of log f’ and px implies positive correlation be- 
tween log f —"" and py except for the trivial case f = id. This is illustrated in Fig- 
ure 4.4. It can be shown [Janzing and Schélkopf, 2015] that uncorrelatedness of f’ 
and px (i.e., the analogue of (4.10) without logarithm) implies positive correlations 
between f” and py, but IGCI uses logarithmic derivatives because this admits 
various information-theoretic interpretations [Janzing et al., 2012]. As justification 
of (4.10), Janzing et al. [2012] describe a model where f is randomly generated 
independently of Py and shows that (4.10) then holds approximately with high 
probability. It should be emphasized, however, that such justifications always refer 
to oversimplified models that are unlikely to describe realistic situations. Note that 
IGCI can easily be extended to bijective relations between vector-valued variables 
(as already described by Daniušis et al. [2010, Section 3]), but bijective determin- 
istic relations are rare for empirical data. Therefore, IGCI only provides a toy 
scenario for which cause-effect inference is possible by virtue of an approximate 
independence assumption. The assumptions of IGCI have also been used [Janzing 
and Schélkopf, 2015] to explain why the performance of semi-supervised learning 
depends on the causal direction as stated in Section 5.1. By no means, is (4.10) 
meant to be the correct formalization of independence of cause and mechanism, 
nor do we believe that a unique formalization exists. Sgouritsa et al. [2015], for in- 
stance, propose an “unsupervised inverse regression” technique that tries to predict 
Py|x from Py and Pyy from Py; they then suggest that the direction with the poorer 
performance is the causal one. Hence, this approach interprets “independence” as 
making such kind of unsupervised prediction impossible. 
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4.1.8 Trace Method 


Janzing et al. [2010] and Zscheischler et al. [2011] describe an IGCI-related inde- 
pendence between Pc and Prjc for the case where C and E are high-dimensional 
variables coupled by a linear SCM: 


Definition 4.11 (Trace condition) Let X and Y be variables with values in R? 
and R°, respectively, satisfying the linear model 


Y =AX+Nx, Nx LX, (4.12) 


where A is an e x d matrix of structure coefficients. Then Px y is said to satisfy the 
trace condition from X to Y if the covariance matrix Xxx and A are “independent” 
in the sense that 

Te(ALxxA’ ) = Ty(Lxx)t(AA’), (4.13) 


where %(B) := tr(B)/k denotes the renormalized trace of a matrix B. 


A simple case that violates the trace condition would be given by a matrix A that 
shrinks all eigenvectors of Xxx corresponding to large eigenvalues and stretches 
those with small eigenvalues. This would certainly suggest that A has not been 
chosen independently of Xxx. Roughly speaking, (4.13) describes an uncorrelat- 
edness between the eigenvalues of Xxx and the factor by which A changes the 
length of the corresponding eigenvectors. More formally, (4.13) can be justified by 
a generating model with large d,e in which Xxx and A are independently chosen at 
random according to an appropriate (rotation invariant) prior probability. Then they 
satisfy (4.13) approximately with high probability [Besserve et al., in preparation]. 

For deterministic invertible relations, the causal direction is identifiable. 


Theorem 4.12 (Identifiability via the trace condition) Let both variables X and 
Y be d-dimensional with Y = AX, where A is invertible. If the trace condition 
(4.13) from X to Y is fulfilled, then the backward model 


X=A "Y 


satisfies 
ta(A~'ZyyA~") < ta(Zyy)ta(A'A~*), 
with equality if and only if all singular values of A have the same absolute value. 
Proof. The proof follows by applying Theorem 2 in Janzing et al. [2010] to the 


case n := m := d and observing that cov(Z, 1 /Z] is negative whenever Z is a strictly 
positive random variable that is almost surely not constant. 
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Hence, in the generic case, the trace condition is violated in backward direction 
and the violation of the equality has always the same sign. 

For noisy relations, no statement like Theorem 4.12 is known. One can still check 
whether (4.13) approximately holds in one of the directions and infer this to be the 
causal one. Then the structure matrix for the causal model from Y to X is no longer 
given by A~!. In this case, we introduce the notation Ax for the model from X to Y 
and Ay for the model from Y to X. What makes the deterministic case particularly 
nice is the fact that the quotient 


t(AxLyyAk) 
t(AxAk) t(Lyy) 


is known to be smaller than 1 because Ax = Ay 

The theoretical justification of independence conditions like (4.10), (4.13), and 
others mentioned in this book rely on highly idealized generating models (for in- 
stance, (4.13) has been justified by a model where the covariance matrix of the 
cause is generated from a rotation invariant prior [Janzing et al., 2010]). There is 
some hope, however, that violations of the idealized assumptions do not necessar- 
ily spoil the causal inference methods. The metaphor with the Beuchet chair may 
help to make this point. First, consider a scenario where the observational vantage 
point is chosen uniformly on a sphere. Clearly, this would contain no information 
about the orientation of the object. In this sense, the uniform prior formalizes an 
“independence” assumption. Then the chair illusion only happens for a negligible 
fraction of angles. It is easy to see that strict uniformity for the choice of the van- 
tage point is not needed to come to this conclusion. Instead, any random choice 
from a prior that is not concentrated within this small fraction of special angles will 
yield the same result. In other words, the conclusion about what a typical subject 
would see is robust with respect to violations of the underlying independence as- 
sumption. For this reason, discussions about the idealized assumptions of causal 
inference should focus on the question to what extent violations spoil the inference 
methods rather than explaining why they are too idealized. 


4.1.9 Algorithmic Information Theory as Possible Foundation 


This section describes an independence principle of which it is unclear how to 
apply it in practice although it relies on a well-defined mathematical formalism. 
It thus plays an intermediate role between the informal philosophical discussion 
about foundations of causal inference in Section 2.1 on the one hand and the con- 
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crete results of Sections 4.1.3 to 4.1.8 on possible asymmetries between cause and 
effect that rely on rather specific model assumptions on the other hand. 

To formalize that Pg and Peg contain no information about each other for more 
general models than the ones considered in Sections 4.1.7 and 4.1.8 is challenging. 
It requires a notion of information that refers to objects other than random vari- 
ables. This is because Pg and Pog are not random variables themselves but they 
describe distributions of random variables. One interesting notion of information 
is given by Kolmogorov complexity, which we briefly explain now. 


Notions of Algorithmic Information Theory We first introduce Kolmogorov 
complexity: Consider a universal Turing machine T, that is, an abstraction of a 
computer that is ideal in the sense of having access to infinite memory space. For 
any binary string s, we define Ky (s) as the length of the shortest program,’ denoted 
by s*, for which T outputs s and then stops [Solomonoff, 1964, Kolmogorov, 1965, 
Chaitin, 1966, Li and Vitányi, 1997]. One may call s* the shortest compression of s, 
but keep in mind that s* contains all the information that T needs for running the 
decompression. Hence, 
Kr(s):=|s"|, 


where |- | denotes the number of digits of a binary word. This defines a probability- 
free notion of information content with respect to the given Turing machine T. In 
the following, we will refer to some fixed T and therefore drop the index. Although 
K(s) is uncomputable, that is, there is no algorithm that computes K(s) from s [Li 
and Vitanyi, 1997], it can be useful to formalize conceptual ideas as it is done in 
this section. 

The conditional algorithmic information of s, given t, is denoted by K(s|t) and 
defined as the length of the shortest program that generates the output s from the 
input string ¢ and then stops. One can then define the mutual information as? 


I(s:t) :=K(s)—K(s|t*). 
In particular, we have [Chaitin, 1966]: 


I(s:t) =K(s)+K(t)—K(s,t), (4.14) 


7The program is given by a binary word using prefix-free encoding; that is, no program code is 
the prefix of another one. Otherwise one would need an extra symbol indicating the end of the code. 

8Note that conditioning on ¢* instead of t makes a difference since there is no algorithm that 
computes t* from t (but vice versa); ¢* can thus be more valuable as input than ż. It turns out that 
K(s|t*) shows closer analogies to conditional Shannon entropy than K(s|f). 
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where the symbol = indicates that the equation only holds up to constants; that is, 
there is an error term whose length can be bounded independently of the lengths 
of s and t. To define Kolmogorov complexity K(s,t) for the pair (s,t), one con- 
structs a simple bijection between strings and pairs of strings by first using some 
enumeration of strings and then using a standard bijection between N and N x N. 
A simple interpretation of (4.14) is that algorithmic mutual information thus 
quantifies the amount of memory space saved when compressing s,f jointly in- 
stead of compressing them independently. Janzing and Schdlkopf [2010] argue 
that two objects whose binary descriptions s,¢ have a significant amount of mu- 
tual information are likely to be causally related. In other words, in the same 
way as statistical dependences between random variables indicate causal relations 
(see Principle 1.1), algorithmic dependences between objects indicate causal rela- 
tions between objects. Observing, for instance, two T-shirts with similar designs 
produced by different companies may indicate that one company copied from the 
other. Indeed, similarity of patterns in real life may be described by algorithmic 
mutual information provided that one has first agreed on an “appropriate” way to 
encode the pattern into a binary word and then on an “appropriate” Turing ma- 
chine. For the difficult question of what “appropriate” means, see also the brief 
discussion of “relative causality” in the introduction of Janzing et al. [2016]. 


Algorithmic Independence of Conditionals The principle of algorithmically 
independent conditionals has been stated by Janzing and Schélkopf [2010] and 
Lemeire and Janzing [2013] for multivariate causal structures, but it yields non- 
trivial implications already for the bivariate case. 

For two variables C and E being cause and effect, we assume that Pc and Pec 
admit finite descriptions by binary strings s and t, respectively. In a parametric 
setting, s and t may describe points in the corresponding parameter spaces. Alter- 
natively, one may think of s and ft as being programs that compute p(c) and p(e|c) 
for all values c,e having finite description length. Then we use I(Pe : Peje) for 
I(s : t) and postulate: 


Principle 4.13 (Algorithmically independent conditionals) Pc and Ppjc are al- 
gorithmically independent, that is, 


I(Pc: Pec) 0, (4.15) 


or, equivalently, 
K (Poe) = K(Pc) +K (Pac). (4.16) 
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The equivalence of (4.15) and (4.16) is immediate because describing the pair 
(Fc, Pgic) is equivalent to describing the joint Pc. The idea of Principle 4.13 
is that Fc and Pic are causally unrelated objects of nature. This is certainly an 
idealized assumption, but for a setting where X causes Y or Y causes X it suggests 
to infer X —> Y whenever the algorithmic dependences between Py and Pyy are 
weaker than for Pyy and Py. To apply this to empirical data, however, raises the 
problem that Py y cannot be determined from finite data on top of the problem that 
algorithmic mutual information is uncomputable. 

Despite these issues, Principle 4.13 is helpful to justify practical causal infer- 
ence methods as we describe now for the example of ANMs. Janzing and Steudel 
[2010] argue that the SCM Y := fy(X) + Ny implies that the second derivative of 
y + log p(y) is determined by partial derivatives of (x,y) +> log p(x|y). Hence, 
knowing Py\y admits a short description of Py (up to some accuracy). Whenever 
K (Py) is larger than this small amount of information, Janzing and Steudel [2010] 
conclude that Y — X should be rejected because Py and Pyy are algorithmically 
dependent. For any given data set we cannot guarantee that K (Py) is large enough 
to reject Y — X just because there is an ANM from Y to X. However, when ap- 
plying inference that is based on the principle of ANMs to a large set of different 
distributions, we know that most of the distributions Py are complex enough (since 
the set of distributions with low complexity is small) to justify rejecting causal 
models that induce ANMs in the opposite direction. Moreover, Figure 5.4, left and 
right, shows two simple toy examples where looking at Py alone suggests a simple 
guess for the joint distribution Py y. Indeed, one can show that this amounts to al- 
gorithmic dependence between Py and Pyy, as shown for the left case by Janzing 
and Schélkopf [2010, remarks after Equation (27)]. 

We should also point out that (4.15) implies 


K(Pc) + K(Pgic) = K (Pog) < K(Pr) +K (Poe). (4.17) 


The equality follows because describing Pcg is equivalent to describing the pair 
(Fc, Prc), which is not shorter than describing marginal and conditional separately. 
The inequality follows because Pg and Pog also determine Pc. zg. In other words, 
independence of conditionals implies that the joint distribution has a shorter de- 
scription in the causal direction than in the anticausal direction.? 


° Checking whether the left-hand side of inequality (4.17) is smaller than the right-hand side is not 
the only option to test independence: whenever two strings are algorithmically independent, applying 
functions of complexity O(1) to each of them generates again two (possibly simpler) algorithmically 
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This implication also sounds natural from the perspective of the minimum de- 
scription length principle [Griinwald, 2007] and in the spirit of Occam’s razor. 


Note, however, that the condition K (Pc) + K(Pgic) < K(Pg) +K(Pejg) is strictly 
weaker than (4.15) since the shortest description of Pe, may not use either of the 
two possible factorizations, which can happen, for instance, when there is a hidden 
common cause [Janzing and Schélkopf, 2010, p. 16]. 

Principle 6.53 generalizes Principle 4.13 to the multivariate setting. 


4.2 Methods for Structure Identification 


We now present different ideas about how the identifiability results obtained in 
Section 4.1 can be exploited for causal discovery. That is, the methods estimate a 
graph from a finite data set. These are challenging statistical problems, which can 
be approached in many different ways. We try to focus on methodological ideas 
and do not claim that the methods we present make the most efficient use of the 
data. It is very well possible that future research will yield novel and successful 
methods. We restrict the attention to a few examples, mainly to those for which we 
have reasonable experience regarding their performance. 


4.2.1 Additive Noise Models 


For causal learning methods based on the identifiability of ANMs according to 
Theorem 4.5, we mainly refer to the multivariate chapter (Section 7.2). Here, we 
sketch two methods without claiming their optimality. The first method tests the 
independence of residuals and is a special case of the regression with subsequent 
independence test (RESIT) algorithm (see Section 7.2). 
1. Regress Y on X; that is, use some regression technique to write Y as a func- 
tion fy of X plus some noise. 
2. Test whether Y — fy(X) is independent of X. 
3. Repeat the procedure with exchanging the roles of X and Y. 
4. If the independence is accepted for one direction and rejected for the other, 
infer the former one as the causal direction. 
Figure 4.5 shows the procedure on a simulated data set; see Figure 4.1 for the un- 
derlying distribution. At least in the continuous setting, the first two steps are stan- 


independent strings [Janzing and Schölkopf, 2010, Lemma 6]. This way, one can in principle reject 
algorithmic independence without knowing the complexities of the strings to start with. 
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Figure 4.5: We are given a sample from the underlying distribution and perform a linear 
regression in the directions X — Y (left) and Y — X (right). The fitted functions are shown 
in the top row, the corresponding residuals are shown in the bottom row. Only the direction 
X — Y yields independent residuals; see also Figure 4.1. 


dard problems of machine learning and statistics (see Appendices A.1 and A.2), 
with the additional challenge that they are coupled: fy deviating from fy may hide 
or create dependences between noise and input variable. In general, any test based 
on the estimated residuals may lose its type I error control. As a possible solution 
one may use sample splitting [Kpotufe et al., 2014]. Moreover, it is important to 
choose an independence test that accounts for higher order statistics rather than 
testing correlations only. Any regression technique minimizing quadratic error that 
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includes linear components and an intercept yields uncorrelated noise.!° In prac- 
tice, one may use the Hilbert-Schmidt Independence Criterion (HSIC) [Gretton 
et al., 2008], for example, which we briefly introduce in Appendix A.2. Mooij 
et al. [2016, Theorem 20] use a continuity property of HSIC to show that even 
without sample splitting, one obtains the correct value of HSIC in the limit of infi- 
nite data (there are no claims about the p-values of the test, however). Finally, the 
last step deserves our particular attention because it refers to the relation between 
probability and causality. Depending on the significance levels for rejecting and 
accepting independence, one may get an ANM in both directions, in no direction, 
or in one direction. To enforce decisions, one just infers the direction to be the 
causal one, for which the p-value for rejecting independence is higher. 

Recent studies provide some evidence that this procedure yields success rates on 
real data above chance level [Mooij et al., 2016]. Figure 4.6 shows the scatter plot 
of real-world data!! for which an ANM holds reasonably well only in the causal 
direction. For modifications regarding discrete data, we refer to the correspond- 
ing literature [Peters et al., 2011a]. Note that the post-nonlinear model (4.9) is 
considerably harder to fit in practice than the more standard nonlinear regression 
model (4.3). 

As an alternative to the preceding approach, one may also use a maximum 
likelihood-based approach. Consider a nonlinear SCM with additive Gaussian er- 
ror terms, for example. One may then distinguish between X — Y and X + Y 
by comparing the likelihood scores of both models. To do so, we first perform a 
nonlinear regression from Y on X to obtain residuals Ry := Y — fy(X). We then 
compare 

Lx.y = —log var[X] — log var|Ry| (4.18) 


with the analogous version 
Lyey = — log var[Rx] — log var|Y] (4.19) 


that we obtain when interchanging the roles of X and Y. It is not difficult to 
show (see Problem 4.16) that this indeed corresponds to a comparison of likeli- 
hoods when instead of performing the regression, we use the true conditional mean 


10This can easily be seen using the following standard geometric picture: cov|.,.] defines an inner 
product in the space of centred random variables with finite variance. Then the length of the vector 
Y — aX is minimal when it is orthogonal to X. 

1 This is pair001 in the database of cause-effect pairs https: //webdav.tuebingen.mpg.de/ 
cause-effect/; see also [Mooij et al., 2016]. 
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Figure 4.6: Relation between average temperature in degrees Celsius (Y) and altitude in 
meters (X) of places in Germany. The data are taken from “Deutscher Wetterdienst,” see 
also Mooij et al. [2016]. A nonlinear function (which is close to linear in the regime far 
away from sea level) with additive noise fits these empirical observations reasonably well. 


f(x) = E[Y |X = x] (and similarly for fx). As before, however, this two-step 
procedure of first performing regression and then computing sample variances re- 
quires justification. BiihImann et al. [2014] use empirical process theory [van de 
Geer, 2009] to prove consistency. If the noise does not necessarily follow a Gaus- 
sian distribution, we have to adapt the score functions by replacing the logarithm of 
the empirical variance of the residuals with an estimate of the differential entropy 
of the error term [Nowzohour and Biihlmann, 2016]. 


Code Snippet 4.14 The following code shows an example with a finite data set. 
It makes use of the code packages dHSIC [Pfister et al., 2017] and mgcv [Wood, 
2006]. The former package contains the function dhsic.test, an implementation 
of the independence test proposed by [Gretton et al., 2008], and the latter package 
contains the function gam that we use as a nonlinear regression method in lines 
10 and 11 (see Section A.1). Only in the backward direction is the independence 
between residuals and input rejected, see lines 15 and 17. In lines 21 and 23, 
we see that a Gaussian likelihood score favors the forward direction, too; see also 
Equations (4.18) and (4.19). 
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library (dHSIC) 

library (mgcv) 

# 

# generate data set 

set.seed(1) 

X <- rnorm(200) 

Y <- X°3 + rnorm(200) 

# 

# fit models 

modelforw <- gam(Y ~ s(X)) 

modelbackw <- gam(X ~ s(Y)) 

# 

# independence tests 

dhsic.test (modelforw$residuals, X)$p.value 
# [1] 0.7628932 

dhsic.test (modelbackw$residuals, Y)$p.value 
# [1] 0.004221031 

# 

# computing likelihoods 

- log(var(X)) - log(var(modelforw$residuals) ) 
# [1] 0.1420063 

- log(var(modelbackw$residuals)) - log(var(Y)) 
# [1] -1.014013 


4.2.2 Information-Geometric Causal Inference 


We sketch the implementation of IGCI briefly and refer to Mooij et al. [2016] for 
details. The theoretical basis is given by the identifiability result in Theorem 4.10 
and some simple conclusions thereof. One can show that the independence condi- 
tion (4.10) implies 

Cxysy <Cysx 


if one defines 
Cov | 10g f'@)p(wax, 
and Cy_,x similarly. Here, the following straightforward estimators are used: 


A lYj+1 —yy| 
C = — log —— 
XY =e 2 kamal 


where the xı < x2 < --- < xy are the observed x-values in increasing order. If Y is 
an increasing function of X, the y-values are also ordered, but for real data this will 
usually not be the case. The estimator Cy_,x is defined accordingly and X — Y is 
inferred whenever Cy sy < Cyox. Apart from the so-called slope-based approach, 
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there is also an entropy-based approach. One can show that (4.10) also implies 
H(X) < H(Y), 


where H denotes the differential Shannon entropy 


nje I eri 


Intuitively, the reason is that applying a nonlinear function f to py generates ad- 
ditional irregularities (unless the nonlinearity of f is tuned relative to py) and thus 
makes py even less uniform than py. Accordingly, the variable with the larger en- 
tropy is assumed to be the cause. To estimate H, one can use any standard entropy 
estimator from the literature. 


4.2.3 Trace Method 


Recall that this method relies on linear relations between high-dimensional vari- 
ables X and Y. First assume that the sample size is sufficiently large (compared 
to the dimensions of X and Y) to estimate the covariance matrices Xxx and Lyy 
and the structure matrices Ay and Ax by standard linear regression. To employ 
the identifiability result in Theorem 4.12, one can compute the tracial dependency 
ratio 
T (Ay£xx4?}) 
T(AyAYy)t(Zxx)’ 


and likewise ry_,x (via swapping the roles of X and Y) and infer that the one that 
is closer to 1 corresponds to the causal direction [Janzing et al., 2010]. 
Zscheischler et al. [2011] describe a method to assess whether the deviation 
from 1 is significant, subject to a generating model where independence of the two 
matrices A and Xxx is simulated by some random orthogonal map rotating them 
against each other. Using ideas from free probability theory [Voiculescu, 1997], 
a mathematical framework that describes asymptotic behavior of large random ma- 
trices, Zscheischler et al. [2011] construct an implementation of the trace condition 
for the regime where the dimension is larger than the sample size. They show that, 
in the noiseless case, ry_,y can still be estimated (although there is not enough data 
to estimate Xxx and A) subject to an additional independence assumption for A 
and the empirical covariance matrix of X. Therefore, one can reject the hypothesis 
X — Y whenever the estimator deviates significantly from 1. Then, either the ad- 
ditional independence assumption is wrong or rx-,y deviates significantly from 1. 


TX SY = 
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4.2.4 Supervised Learning Methods 


Finally, we describe a method that approaches causal learning from a more ma- 
chine learning point of view. It has, in principle, the ability to make use of either 
restricted function classes or an independence condition. Suppose, we are given 
labeled training data of the form (D1,A1),...,(Dn,An). Here, each D; is a data set 


Di = {(X1,Y1),. may (Xnis¥n;) 


containing realizations (X1,Y1),...,(Xn;,¥n,) id Piia and each label A; € {—, +} 
describes whether data set D; corresponds to X — Y or X +Y. Then, causal learn- 
ing becomes a classical prediction problem, and one may train classifiers hoping 
that they generalize well from the data set with known ground-truth to unseen test 
data sets. 

To the best of our knowledge, Guyon [2013] was the first one who systematically 
investigated such an approach in the form of a challenge (providing a mix of syn- 
thetic and real data sets as known ground truth data). It is clear that the method 
will not succeed by exploiting symmetric features as correlation or covariance. 

Many of the competitive classifiers in the challenge were based on hand-crafted 
features; examples include entropy estimates of the marginal distributions or en- 
tropy estimates of the distribution of the residuals that resulted from regressing 
either X on Y or Y from X. Interestingly, such features can be related to the con- 
cept of ANMs. For Gaussian distributed variables, for example, the entropy is a 
linear function of the logarithm of the variance and, therefore, the features are ex- 
pressive enough to reconstruct the scores (4.18) and (4.19). Considering entropies 
instead of logarithm of variances corresponds to relaxing the Gaussianity assump- 
tion [Nowzohour and Bühlmann, 2016]. 

Lopez-Paz et al. [2015] aims at an automatic construction of such features. The 
idea is to map the joint distributions Pry, i=1,...,n into a reproducing kernel 
Hilbert space (see Appendix A.2) and perform a classification in this space. In 
practice, one does not have access to the full distribution Ply and rather uses the 
empirical distribution as an approximation. (A similar approach has been used to 
distinguish time series that are reversed in time from their original version [Peters 
et al., 2009a].) Because the classification into cause and effect seems to rely on 
relatively complex properties of the joint distribution, one requires a large sample 
size n for the training set. To add useful simulated data sets, these must be gener- 
ated from identifiable cases. Lopez-Paz et al. [2015] use additional samples from 
ANMs, for example. 

Supervised learning methods do not yet work as stand-alone methods for causal 
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learning. They may prove to be useful, however, as statistical tools that can make 
efficient use of known identifiability properties or combinations of those. 


4.3 Problems 


Problem 4.15 (ANMs) a) Consider the SCM 


b 


eae 


X := Ny 
Y := 2X + Ny 


with Ny uniformly distributed between 1 and 3 and Ny uniformly distributed 
between —0.5 and 0.5 and independent of Nx. The distribution Py y admits 
an ANM from X to Y. Draw the support of the joint distribution of X,Y and 
convince yourself that Py y does not admit an ANM from Y to X, that is there 
is no function g and independent noise variables My and My such that 


X =g8(Y)+Mx 
Y =My 


with Mx independent of My. 
Similarly as in part a), consider the SCM 
X= Ny 
Y :=X?+Ny 
with Nx uniformly distributed between 1 and 3 and Ny uniformly distributed 


between —0.5 and 0.5 and independent of Ny. Again, draw the support of 
Py y and convince yourself that there is no ANM from Y to X. 


Problem 4.16 (Maximum likelihood) Assume that we are given an i.i.d. data set 
(X1, Y1), , (Xn, Yn) from the model 


Y = f(X)+Ny, with X ~ N(x, og), and Ny ~ N (Hn, , On, ) independent, 


where the function f is supposed to be known. 


a) Prove that f(x) = EY |X =x]. 


70 Chapter 4. Learning Cause-Effect Models 


b) Write x := (x1,.--,Xn), yY := (1,---,¥n) and consider the log-likelihood func- 
tion a 
lo (x,y) = £6 ((x1,y1), -Xn Yn)) = 2 log po (xi, yi), 
i=l 
where pọ is the joint density over (X,Y) and 0 := (Ux, Un, , O$, Of, )- Prove 
that for some c1,c2 E€ R with cp > 0 


max le (x,y) = c2: (cı — log var[x] — log var [y — f(x)]), (4.20) 


— lyn 


where Vat|z] := +£% (zi — +X}; zk) estimates the variance. 


Equation (4.20) motivates the comparison of expressions (4.18) and (4.19). The 
main difference is that in this exercise, we have used the conditional mean and not 
the outcome of the regression method. One can show that, asymptotically, the latter 
still produces correct results [Bühlmann et al., 2014]. 


5 


Connections to Machine Learning, I 


As argued in Chapter 1, standard machine learning rests on the same basis as statis- 
tics: we use data sampled i.i.d. from some unknown underlying distribution, and 
seek to infer properties of that distribution. In contrast, causal inference assumes 
a stronger underlying structure, including directed dependences. This makes it 
harder to learn about the structure from data, but it also allows novel statements 
once this is done, including statements about the effect of distribution shifts and 
interventions. If we view machine learning as the process of inferring regularities 
(or “laws of nature”) that go beyond pure statistical associations, then causality 
plays a crucial role. The present chapter presents some thoughts on this, focusing 
on the case of two variables only. Chapter 8 will revisit this topic and look at the 
multivariate case. 


5.1 Semi-Supervised Learning 


Let us consider a regression task, in which our goal is to predict a target variable 
Y from a d-dimensional predictor variable X. For many loss functions, knowing 
the conditional distribution Pyg suffices to solve the problem. For instance, the 
regression function 


f(x) = EY |X =x] 


minimizes the L loss, 


f? € argmin E lv - f(x) | : 
f:RI>R 
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In supervised learning , we receive n i.i.d. data points from the joint distribution: 
CX Y1), , Xn, Yn) ig Px y. Regression estimation (with L2 loss) thus amounts 
to estimating the conditional mean from n data points of the joint distribution. In 
(inductive) semi-supervised learning (SSL), however, we receive m additional 
unlabeled data points X,41,.--,Xp+m K Px. The hope is that these additional data 
points provide information about Px, which itself tells us something about E|Y |X] 
or more generally about Prix. Many assumptions underlying SSL techniques [see 
Chapelle et al., 2006, for an overview] concern relations between Px and Pyg. The 
cluster assumption, for instance, stipulates that points lying in the same cluster 
of Px have the same or a similar Y; this is similar to the low-density separation 
assumption that states that the decision boundary of a classifier (i.e., points x where 
P(Y = 1|X =x) crosses 0.5) should lie in a region where Px is small. The semi- 
supervised smoothness assumption says that the conditional mean x > E[Y |X = x] 
should be smooth in areas where Px is large. 


5.1.1 SSL and Causal Direction 


In the simplest setting, where the causal graph has only two variables (cause and 
effect), a machine learning problem can either be causal (if we predict effect from 
cause) or anticausal (if we predict cause from effect). Practitioners usually do 
not care about the causal structure underlying a given learning problem (see Fig- 
ure 5.1). However, as we argue herein, the structure has implications for machine 
learning. 

In Section 2.1, we have hypothesized that causal conditionals are independent 
of each other (Principle 2.1 and subsequent discussion). Schdlkopf et al. [2012] 
realize that this principle has a direct implication for SSL. Since the latter relies on 
the relation between Px and Py)x and the principle claims that Poguse and Petfect|cause 
do not contain information about one another, we can conclude that SSL will not 
work if X corresponds to the cause and Y corresponds to the effect (i.e., for a 
causal learning problem). In this case, additional x-values only tell us more about 
Px — which is irrelevant because the prediction requires information about the 
independent object Py;x. On the other hand, if X is the effect and Y is the cause, 
information on Px may tell us something about Py)x. 

A meta-study that analyzed results in SSL supports our hypothesis. All cases 


l Again, we use the notation Pyjx as a shorthand for the collection (Pyix=x)x of conditional dis- 
tributions. 
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Figure 5.1: Top: a complicated mechanism @ called the ribosome translates mRNA infor- 
mation X into a protein chain Y. Predicting the protein from the mRNA is an example of 
a causal learning problem, where the direction of prediction (green arrow) is aligned with 
the direction of causation (red). Bottom: In handwritten digit recognition, we try to infer 
the class label Y (i.e., the writer’s intention) from an image X produced by a writer. This 
is an anticausal problem. 


where SSL helped were anticausal, or confounded, or examples where the causal 
structure was unclear (see Figure 5.2). 

Within the toy scenario of a bijective deterministic causal relation (see Sec- 
tion 4.1.7), Janzing and Schélkopf [2015] prove that whenever Peause and Petfect|cause 
are independent in the sense of (4.10), then SSL indeed outperforms supervised 
learning in the anticausal direction but not in the causal direction. The idea is that 
SSL employs the dependence (4.11) for an improved interpolation algorithm. 

Sgouritsa et al. [2015] have developed a causal learning method that exploits the 
fact that SSL can only work in the anti-causal direction. 

Finally, note that SSL contains some versions of unsupervised learning as a spe- 
cial case (with no labeled data). In clustering, for example, Y is often a discrete 
value indicating the cluster index. Similarly to the preceding reasoning, we can 
argue that if X is the cause and Y the effect, clustering should not work well. In 


2By user “Boumphreyfr”, https: //commons.wikimedia.org/wiki/File:Peptide_syn. 
png, [CC BY-SA 3.0 (http: //creativecommons.org/licenses/by-sa/3.0) or GFDL (http: 
//wiw.gnu.org/copyleft/fdl.htm1)] 
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Figure 5.2: The benefit of SSL depends on the causal structure. Each column of points 
corresponds to a benchmark data set from the UCI repository and shows the performance 
of six different base classifiers augmented with self-training, a generic method for SSL. 
Performance is measured by percentage decrease of error relative to the base classifier, 
that is, (error(base) — error(self-train)) /error(base). Self-training overall does not help for 
the causal data sets, but it does help for some of the anticausal/confounded data sets [from 
Schélkopf et al., 2012]. 


many applications of clustering on real data, however, the cluster index is rather 
the cause than the effect of the features. 

While the empirical results in Figure 5.2 are promising, the statement that SSL 
does not work in the causal direction (always assuming independence of cause and 
mechanism, cf. Principle 2.1) needs to be made more precise. This will be done 
in the following section; it may be of interest to readers interested in SSL and 
covariate shift, but could be skipped at first reading by others. 


5.1.2 A Remark on SSL in the Causal Direction 


A more precise form of our prediction regarding SSL reads as follows: if the task 
is to predict y for some specific x, knowledge of Py does not help when X — Y is 
the causal direction. However, even if Py does not tell us anything about Pyy (due 
to X — Y), knowing Py can still help us for better estimating Y in the sense that we 
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Figure 5.3: In this example, SSL reduces the loss even in the causal direction. Since for 
every x, the label zero is a priori more likely than the label one, the expected number of 
errors is minimized when a function is chosen that attains one at a point x where p(x) is 
minimal (here: x = 3). 


obtain lower risk in a learning scenario. 

To see this, consider a toy example where the relation between X and Y is given 
by a deterministic function, that is, Y = f(X), where f is known to be from some 
class F of functions. Let X take values in {1,...,m} with m > 3 and let Y be a bi- 
nary label attaining values in {0,1}. We define the function class F := {fi,..., fin} 
by fj(j) = 1 and fj(k) =0 for k £ j. In other words, F consists of the set of func- 
tions that attain the value one at exactly one point. Figure 5.3, top, shows the 
function f3 for m = 4. Suppose that our learning algorithm infers f; while the 
true function is f;. For i Æ j, the risk, that is, the expected number of errors (see 
Equation (1.2)), equals 


=È 0) — fi(x)|p(x) = p(j) + pli), (5.1) 


where p denotes the probability mass function for X. We now average R;(f;) over 
the set F and assume that each f; is equally likely. This yields the expected risk 
(where the expectation is taken with respect to a uniform prior over F) 


—1 m m 


a L — fip) (5.2) 
Nae 
n= . 1 
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To minimize (5.3), we should thus choose fg such that k minimizes the function p. 
This makes sense because for any point x = 1,...,m, the label y = 0 is more likely 
than y = 1 (probability (m—1)/m versus 1/m). Therefore, we would actually like 
to infer zero everywhere, but since the zero function is not contained in F, we 
are forced to select one x-value to which we assign the label zero. Hence, we 
choose one of the least likely x-values to obtain minimal expected loss (which is 
x = 3 for the distribution in Figure 5.3, bottom). Clearly, unlabeled observations 
help identify the least likely x-values, hence SSL can help. This example does not 
require any (x, y)-pairs (labeled instances); unlabeled data x suffices. It is thus actu- 
ally an example of unsupervised learning rather than being a typical SSL scenario. 
However, accounting for a small number of labeled instances in addition does not 
change the essential idea. Generically, these few instances will not contain any 
instance with y = 1 if mis large enough. Hence, the observed (x,y)-pairs only help 
because they slightly reduce F to a smaller class F’ for which the analysis remains 
basically the same, and we still conclude that the unlabeled instances help. 

Although we have not specified a supervised learning scenario as baseline (that 
is, one that does not employ knowledge of Py), we know that it must be worse than 
the best semi-supervised scenario because the optimal estimation depends on Py, 
as we have just argued. 

Here, the independence of mechanisms is not violated (and thus, X can be con- 
sidered as a cause for Y): f is assumed to be chosen uniformly among F, and 
knowing Py does not tell us anything about f. Knowing Py is only helpful for 
minimizing the loss because p(x) appears in (5.2) as a weighting factor. 

The preceding example is close in spirit to a Bayesian analysis because it in- 
volved an average over functions in F. It can be modified, however, to apply to 
a worst case analysis, in which the true function f is chosen by an adversarial to 
maximize (5.1) [see also Kääriäinen, 2005]. Given a function fj, the adversarial 
chooses f; with i an x-value different from j with maximal probability mass. The 
worst case risk thus reads max,z;{p(x)}+ p(j), which is, again, minimized when 
j is chosen to be an x-value that minimizes the probability mass function p(x). 
Therefore, we conclude that optimal performance is attained only when Py is taken 
into account. 

Another example can be constructed on the basis of an argument that is given in 
a non-causality context by Urner et al. [2011, proof of Theorem 4]. They construct 
a case of model misspecification; where the true function fo is not contained in the 
class F that is optimized over. In their example, additional information about the 
marginal Py helps for reducing the risk, even though the conditional Pyy can be 
considered as being independent of the marginal. Our example above is not based 
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on the same kind of model misspecification. Each possible (unknown) ground truth 
fj is indeed contained in the class of functions; however, we would like to minimize 
the expectation of the risk over a prior, and our function class does not contain a 
function that has zero expected risk. Therefore, for the expected risk, this is akin 
to a situation of model misspecification. 

Finally, we try to give some further intuition about the example by Urner et al. 
[2011]. Since fo is not contained in the function class F, we need to find a function 
f € F that minimizes the distance d (f, fo), defined as the risk of f, over f € F; 
we say fo is projected onto F. Roughly speaking, additional information about Py 
provides us with a better understanding of this projection.’ 


5.2 Covariate Shift 


As explained in Section 2.1, the independence between Peause and Peffect|cause (Prin- 
ciple 2.1) can be interpreted in two different ways: in Section 5.1 above, we argued 
that given a fixed joint distribution, these two objects contain no information about 
each other (see the middle box in Figure 2.2). Alternatively, suppose the joint dis- 
tribution Peause,etfect Changes across different data sets; then the change of Poause 
does not tell us anything about the change of Peffect\cause (this corresponds to the 
left box in Figure 2.2). Knowing that X is the cause and Y the effect thus has 
important consequences for a prediction scenario where Y is predicted from X. 
Assume we have learned the statistical relation between X and Y using examples 
from one data set and we are supposed to employ this knowledge for predicting Y 
from X for a second data set. Further, assume that we observe that the x-values 
in the second data set follow a distribution Py that differs from the distribution Py 
of the first data set. How would we make use of this information? By the inde- 
pendence of mechanisms, the fact that Py differs from Py does not tell us anything 
about whether Py), also changed across the data sets. Therefore, it might be the 
case that the conditional Py) still holds true for the second data set. Second, even 
if the conditional did change to Py x F Pyix, it is natural to still use Pyy for our 
prediction. After all, the independence principle states that the new change of the 
marginal distribution from Px to Py does not tell us anything about how the con- 
ditional has changed. Therefore, we use Py|x in absence of any better candidate. 
Using the same conditional Py|y although Py has changed is usually referred to as 


3We are grateful to several people who contributed to this discussion: Sebastian Nowozin, Ilya 
Tolstikhin, and Ruth Urner. 
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Figure 5.4: Example where Py changes to Py in a way that suggests that Py has changed 
and Pyjy remained the same. When Y is binary and known to be the cause of X, observing 
that Py is a mixture of two Gaussians makes it plausible that the two modes correspond to 
the two different labels y = 0,1. Then, the influence of Y on X consists just in shifting the 
mean of the Gaussian (which amounts to an ANM — see Section 4.1.4), which is certainly 
a simple explanation for the joint distribution. Observing furthermore that the weights of 
the mixture changed from one data set to another one makes it likely that this change is 
due to the change of Py. 


covariate shift. Meanwhile, this is a well-studied assumption in machine learning 
[Sugiyama and Kawanabe, 2012]. The argument that this is only justified in the 
causal scenario, in other words, if X is the cause and Y the effect, has been made 
by Schélkopf et al. [2012]. 

To further illustrate this point, consider the following toy example of an anti- 
causal scenario where X is the effect. Let Y be a binary variable influencing the 
real-valued variable X in an additive way: 


X =Y +Ny, (5.4) 


where we assume Ny to be Gaussian noise, independent of Y. Figure 5.4, left, 
shows the corresponding probability density py. 

If its width is sufficiently small, the distribution Py is bimodal. Even if one does 
not know anything about the generating model, Py can be recognized as a mixture 
of two Gaussian distributions with equal width. In this case, one can therefore 
guess the joint distribution Py y from Py alone because it is natural to assume that 
the influence of Y consists only in shifting the mean of X. Under this assumption, 
we do not need any (x, y)-pairs to learn the relation between X and Y. Assume now 
that in a second data set we observe the same mixture of two Gaussian distributions 
but with different weights (see Figure 5.4, right). Then, the most natural conclusion 
reads that the weights have changed because the same equation (5.4) still holds but 
only Py has changed. Accordingly, we would no longer use the same Pyjx for 
our prediction and reconstruct Py, y from Py. The example illustrates that in the 
anticausal scenario the changes of Py and Py|x may be related and that this relation 
may be due to the fact that Py has changed and Pyy remained the same. In other 
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Figure 5.5: Example where X causes Y and, as a result, Py and Pyy contain information 
about each other. Left: Py is a mixture of sharp peaks at the positions 51, 52,53. Right: Py is 
obtained from Py by convolution with Gaussian noise with zero mean and thus consists of 
less sharp peaks at the same positions 51 , 52,53. Then Pyy also contains information about 
$1,582,853 (see Problem 5.1). 


words, Peffect ANd Poauseletfect Often change in a dependent way because Poause and 
Pefiect|cause Change independently. 

The previous example elicits a specific scenario. Conceiving of general methods 
exploiting the fact that Peffect and Poauseļeffect Change in a dependent way is a hard 
problem. This may be an interesting avenue for further research, and we believe 
that causality could play a major role in domain adaptation and transfer problems; 
see also Bareinboim and Pearl [2016], Rojas-Carulla et al. [2016], Zhang et al. 
[2013], and Zhang et al. [2015]. 


5.3 Problems 


Problem 5.1 (Independence of mechanisms) Let Py be the mixture of k sharp 
Gaussian peaks at positions s,,...,5% as shown in Figure 5.5, left. Let Y be ob- 
tained from X by adding some Gaussian noise N with zero mean and a width On 
such that the separate peaks remain visible as in Figure 5.5, right. 


a) Argue intuitively why Pyy also contains information about the positions 
S1,-++,Sx Of the peaks and thus Pyy and Py share this information. 


b) The transition between Py and Py can be described by convolution (from Px 
to Py) and deconvolution (from Py to Px). If Py|x is considered as the linear 
map converting the input Px to the output Py then Py\x coincides with the 
convolution map. Argue why Pyy does not coincide with the deconvolution 
map (as one may think at first glance). 


6 


Multivariate Causal Models 


In Chapter 3, we discussed causal models for two variables. While some of the 
basic notions can be more easily explained in the bivariate case, a lot of the struc- 
ture of causal inference derives from multivariate relations, which involve at least 
three variables. We now consider causal models in the more general case of d > 2 
variables. 

Many of the concepts carry over directly and we hope that the reader, equipped 
with the intuition gained in Chapter 3, can easily follow the definitions of SCMs 
(Section 6.2), interventions (Section 6.3), and counterfactuals (Section 6.4). But 
there are fundamental differences to the bivariate case, too. In Section 6.5, we 
will see that the graph structure implies conditional independence statements that 
have been trivial in the bivariate case. Also, computing intervention distributions 
requires more thought in the multivariate setting: We will discuss adjustment for- 
mulas and do-calculus [Pearl, 2009] in Section 6.6. 

We first introduce some graphical terminology. Most of the definitions are self- 
explanatory and can be found in Spirtes et al. [2000], Koller and Friedman [2009], 
and Lauritzen [1996], for example. The reader who is already familiar with graph- 
ical models may want to skip this section. The most important terms for this book 
are directed acyclic graphs (DAGs), v-structures, and d-separation. 


6.1 Graph Terminology 


Consider finitely many random variables X = (X1,...,Xq) with index set V := 
{1,...,d}, joint distribution Px, and density p(x). A graph G = (V,€) consists 
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of (finitely many) nodes or vertices V and edges € C V? with (v,v) ¢ E for any 
v € V. We further have the following definitions: 

Let G = (V,€) be a graph with V := {1,...,d} and corresponding random vari- 
ables X = (X),...,Xq). A graph G; = (V1,€,) is called a subgraph of G if Vı = V 
and £; C E; we then write Gi < G. If additionally, €; # E, then G; is a proper 
subgraph of G. 

A node i is called a parent of j if (i, j) € E and (j,i) € E and a child if (j,i) € E 
and (i, j) ¢ €. The set of parents of j is denoted by PAY, and the set of its children 
by CHY. Two nodes i and j are adjacent if either (i, j) € E or (j,i) € E. We 
call G fully connected if all pairs of nodes are adjacent. We say that there is an 
undirected edge between two adjacent nodes i and j if (i, j) € E and (j,i) € E. An 
edge between two adjacent nodes is directed if it is not undirected. We then write 
i— j for (i,j) € E. We call G directed if all its edges are directed.'! Three nodes 
are called an immorality or a v-structure if one node is a child of the two others 
that themselves are not adjacent. The skeleton of G does not take the directions 
of the edges into account. It is the graph (V,€) with (i, j) € È, if (i,j) € E or 


(j,i) EE. 
A path in G is a sequence of (at least two) distinct vertices i,...,in, such that 
there is an edge between ix and i1 for all k =1,...,m—1. If ip-; — ig and 


ig+1 —> ik, iz is called a collider relative to this path. If i, — ig4, for all k, we 
speak of a directed path from i; to im and call i; an ancestor of im and im a 
descendant of i;. In this work, all ancestors of i are denoted by ANY and i is not 
an ancestor of itself. Furthermore, i is neither a descendant nor a non-descendant 
of itself. We denote all descendants of i by DEY and all non-descendants of i, 
excluding i, by ND. In this book, ND? include the parents of i in graph G. A 
node without parents is called a source node, a node without children a sink node. 
A permutation 7, that is a bijective function 7: {1,...,d} —> {1,...,d} is called 
a topological or causal ordering if it satisfies 1(i) < m(j) if j € DEY (see also 
Appendix B). 

A graph G is called a partially directed acyclic graph (PDAG) if there is no 
directed cycle, that is, if there is no pair (j, k) with directed paths from j to k and 
from k to j. G is called a directed acyclic graph (DAG) if it is a PDAG and all 
edges are directed. 

Since we will use it at many places herein, we formulate the graphical concept of 
d-separation [Pearl, 1985, 1988] as a definition. 


'Note that this excludes cycles of length 2, but it does not excludes longer cycles. 
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Definition 6.1 (Pearl’s d-separation) In a DAG G, a path between nodes i, and im 
is blocked by a set S (with neither i, nor im in S) whenever there is a node ix, such 
that one of the following two possibilities holds: 


(i) ix €S and 


iki =} ik =} ins 


OV ik-1 “ip 4— ik+1 


or ip-y Hip > Igy 
(ii) neither i; nor any of its descendants is in S and 
1 okie ik41- 


Furthermore, in a DAG G, we say that two disjoint subsets of vertices A and B are 
d-separated by a third (also disjoint) subset S if every path between nodes in A 
and B is blocked by S. We then write 


A I gBIS. 


The reader may have a look at Figure 6.5 and be convinced that for this DAG, we 
have C lg G|X but C jig G| (X,H). 


6.2 Structural Causal Models 


SCMs have been used for a long time in fields such as agriculture, social sciences, 
and econometrics [Wright, 1921, Haavelmo, 1944, Bollen, 1989]; see also Chap- 
ter 2. Model selection, for example, was done by fitting different structures that 
were considered as reasonable given the prior knowledge about the system. These 
candidate structures were then compared using goodness of fit tests. In this chap- 
ter, we introduce the semantics of SCMs and learn how to use them for computing 
intervention distributions, for example. Throughout the whole chapter we will as- 
sume that the SCM or at least its structure is given. We discuss the question of 
identifying the structure in Chapter 7. 


Definition 6.2 (Structural causal models) A structural causal model (SCM) 
€ := (S, Px) consists of a collection S of d (structural) assignments 


Xj := fj(PA;,Nj), JFHAjs yd, (6.1) 
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= fil G(x) 
:= fa(X1,M2) A \ 

| ot P 
; : H ; a } ointly independent 


Figure 6.1: Example of an SCM (left) with corresponding graph (right). There is only one 
causal ordering 7 (that satisfies 3+> 1, 1 œ> 2, 2 > 3,44). 


where PA; C {X1,...,Xa}\ {Xj} are called parents of Xj; and a joint distribution 
PN = Pn,.....N, over the noise variables, which we require to be jointly independent; 
that is, Px is a product distribution. 

The graph G of an SCM is obtained by creating one vertex for each X; and draw- 
ing directed edges from each parent in PA; to Xj, that is, from each variable Xx 
occurring on the right-hand side of equation (6.1) to X; (see Figure 6.1). We hence- 
forth assume this graph to be acyclic. 

We sometimes call the elements of PA not only parents but also direct causes 
of Xj, and we call X; a direct effect of each of its direct causes. SCMs are also 
called (nonlinear) SEMs. 


Although some of the terminology is causal (“direct cause” and “direct effect”), 
Definition 6.2 is purely mathematical. We discuss its role as a model for a real 
system in Section 6.8. 

SCMs are the key for formalizing causal reasoning and causal learning. We first 
show that an SCM entails an observational distribution. But unlike usual proba- 
bilistic models, they additionally entail intervention distributions (Section 6.3) and 
counterfactuals (Section 6.4); see Figure 6.2. 


Proposition 6.3 (Entailed distributions) An SCM € defines a unique distribution 
over the variables X = (X,,...,Xq) such that Xj = fi (PA ;,Nj), in distribution, for 


jJ=1,...,d. We refer to it as the entailed distribution Py and sometimes write Px. 


The proof can be found in Appendix C.2. It formalizes the procedure for how 
we sample n data points from the joint distribution (“ancestral sampling”): We first 
generate an iid. sample N!,...,N” ~ Py and then subsequently use the structural 
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observational distribution intervention distributions 
PÈ A Ty 


t 


counterfactuals 


causal graph 
NNS 
X 


G 


yere 


Figure 6.2: Causal models as SCMs do not only model an observational distribution P 
(Proposition 6.3) but also intervention distributions (Section 6.3) and counterfactuals (Sec- 
tion 6.4). 


assignments (starting from source nodes, then nodes with at most one parent and 
so on) to generate i.i.d data points X!,...,X” ~ Px. Structural assignments (6.1) 
should be thought of as a set of assignments or functions (rather than a set of math- 
ematical equations) that tells us how certain variables determine others. This is the 
reason why we prefer to avoid the term structural equations, which is commonly 
used in the literature. 


Code Snippet 6.4 The following code generates an i.i.d. sample from an SCM 
with the form shown in Figure 6.1: structural assignments fı(x3,n) = 2x3 +n, 
fa(xi,n) = (0.5x1)? +n, f3(n) =n, and f4(x2,x3,n) = x2 +2sin(x3 +n), and jointly 
independent noise variables with a normal, chi squared, uniform, and normal dis- 
tribution, respectively. 


# generate a sample from the distribution entailed by the SCM 
set.seed(1) 

X3 <- runif(100)-0.5 

X1 <- 2*X3 + rnorm(100) 

X2 <- (0.5*X1)72 + rnorm(100)*2 

X4 <- X2 + 2*sin(X3 + rnorm(100)) 


Remark 6.5 (Linear cyclic assignments) In this book we focus mainly on acyclic 
structures. We now briefly discuss linear SCMs with assignments that lead to a 
cyclic structure; these are well understood [Lauritzen and Richardson, 2002, Lac- 
erda et al., 2008, Hyttinen et al., 2012]. We focus on the intuition and do not pro- 
vide a formal treatment. More details for the linear case are provided by Hyttinen 
et al. [2012], and the nonlinear case is discussed by Mooij et al. [2011] and Bongers 
et al. [2016]. 
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Let us denote X = (X1,...,Xq) and consider the assignment 
X := BX+N, 


with a d x d matrix B that allows for a cyclic structure and some noise vector 
N = (M,.--;Na) ~ Py. Formally, if 7 — B is invertible, for each value of N, the 
preceding equation induces a unique solution for X, namely 


X=(I—B) N (6.2) 


(see also Problem 3.8). Equation (6.2) clearly defines a joint distribution over X. 
But what is its (causal) interpretation? 

One possibility is to interpret it as a result of an equilibration process. Consider 
a sequence of random variables X’ that occur as solutions to the iteration 


X‘ := BX! +N, t =1,2,.... (6.3) 


The sequence X* converges if B’ — 0 as t — œ, which is equivalent to the eigenval- 
ues of B lying within the unit circle. This is a strictly stronger condition than the 
invertibility of Z — B (see Problem 6.60). If satisfied, the distribution of the limit is 
identical to the distribution induced by Equation (6.2); see Problem 6.61. 

In (6.3), we have added the same noise realization in each time step. The limiting 
distribution of X’ changes if we instead update the noise in each step: 


X‘ := BX’! 4N"! ¢=1,2,... (6.4) 


with N!,N?,... being iid. copies of Nt. This can be regarded as a time series 
setting and will be discussed in Section 10.2. 


Proposition 6.3 shows that each SCM entails a distribution. What about the other 
direction? Is any distribution entailed by an SCM? Indeed, we will see later (Propo- 
sition 7.1) that each distribution can be induced by any SCM whose graph structure 
is acomplete DAG (a DAG is called complete if any pair of vertices is connected). 
This means that the (observational) model class of SCMs, that is, the set of distri- 
butions that can be induced by an SCM, is the set of all distributions. 

The definition of SCMs allows for the possibility that a variable appears on the 
right-hand side of the structural assignment without affecting the variable on the 
left-hand side. Even though such a parent-child relation is in some sense “inactive,” 
it still appears as an edge in the corresponding graph. Formally, we exclude this by 
the following remark: 
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Remark 6.6 (Structural minimality of SCMs) Definition 6.2 can be read such 
that one distinguishes between the two SCMs 


Sı: X:=Ny, Y:=0-X+Ny and 
S2: X := Ny, Y := Ny, 


even though clearly 0-X = 0. This contradicts our intuition. We therefore add the 
requirement that the functions f; depend on all of their input arguments. Mathe- 
matically speaking, whenever there is a k € {1,...,d} and a function g such that 


fx (Pay, nx) = g(paz, ng), Vpa,, Yng with p(ng) > 0, (6.5) 


where PA; Ç PA;, we choose the latter representation. In the preceding example, 
we would therefore choose the representation S2 over Sı. We will see later that 
these two SCMs can indeed be identified in that they entail the same observational 
distribution, intervention distribution,” and counterfactuals (see Section 6.8). 
Furthermore, there is a unique representation in which each function has a mini- 
mal number of inputs. Although this statement seems plausible, we formally prove 
it in Appendix C.3. We say that such an (least) SCM satisfies structural minimal- 
ity. From now on, we assume that structural minimality holds. As opposed to 
faithfulness (Section 6.5), for example, this is not an assumption about the under- 
lying world. It is a convention to avoid redundant descriptions. 


Remark 6.7 (Relationship to ordinary differential equations) In Remark 6.5, 
we have already seen a relation between SCMs and discrete time models, and we 
would now like to comment on continuous time models. In physical systems, we 
would often expect that causal relationships are governed by sets of coupled dif- 
ferential equations. A differential equation system X = f(X) can be represented 
approximately as an assignment X;+,; := X; + At - f(X;) with small At > 0, and 
it thus contains information about the causal structure at a fine-grained time scale. 
An intervention can be implemented physically as a forcing term pulling a variable 
toward a desired value. Under certain stability assumptions, we can assay the ef- 
fect of interventions in a time-independent manner by analyzing the behavior of the 
equilibrium state. This entails an SCM that describes how the equilibrium states 


2We do not allow for interventions that keep the function in the structural assignment fixed and 
change only the noise distribution; see (6.5). 

3This term does not coincide with causal minimality (Definition 6.33). Causal minimality implies 
structural minimality (Proposition 6.49) but not vice versa; see Problem 6.57. 
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of such a dynamical system will react to physical interventions on the observables 
[Mooij et al., 2013]. In the SCM, the variables no longer describe measurements at 
specific points in time. On this phenomenological level, the original time structure 
disappears. The framework is in principle also applicable to cyclic structures, but 
it does not yet address the stochastic case; the theory is restricted to determinis- 
tic relations. This shortcoming is significant, since uncertainty can arise from a 
number of sources, including incomplete knowledge of the parameters of the dif- 
ferential equations or of initial conditions, and — as always — confounding. We 
will not discuss further details on deriving phenomenological structural equations 
from differential equations and refer to some literature instead [see, e.g., Dash, 
2005, Hansen and Sokol, 2014]. 

Our main motivation for this remark is to avoid a common misconception. It 
is sometimes argued that part of the task of causal inference becomes obsolete 
by specifying the exact time to which a variable refers. This view is particularly 
supported by physics where it is common that every measurement can be uniquely 
assigned to a point in space-time where it has been performed. These arguments 
show, however, that even variables in physics do not always refer to observations 
that are well-defined in time — for example, because they arise from an equilibrium 
scenario. 


6.3 Interventions 


We are now ready to model interventions in a system. Intuitively, when we inter- 
vene on variable X2, say, and set it to the binary outcome of a coin flip, we expect 
that this intervention changes the distribution of the system compared to its earlier 
behavior without intervention. Furthermore, even if the variable X was causally 
influenced by other variables before, it is now influenced by nothing else than the 
coin flip: its causal parents have changed. 

Formally, we construct intervention distributions from an SCM €. They are ob- 
tained by making modifications to € and considering the new entailed distribution. 
In general, intervention distributions differ from the observational distribution. 


Definition 6.8 (Intervention distribution) Consider an SCM € := (S, Px) and its 
entailed distribution Pe. We replace one (or several) of the structural assignments 
to obtain anew SCM €. Assume that we replace the assignment for Xx by 


Xk := FPA, Ñi). 


We then call the entailed distribution of the new SCM an intervention distribution 
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and say that the variables whose structural assignment we have replaced have been 
intervened on. We denote the new distribution byt 


PČ = pee lef heh) 


The set of noise variables in Č now contains both some “new” N’s and some “old” 
N’s, all of which are required to be jointly independent. 

When f( PA,, Nx) puts a point mass on a real value a, we simply write peas 
and call this an atomic intervention.’ An intervention with PA, = PA,, that is, 
where direct causes remain direct causes, is called imperfect.® This is a special 
case of a stochastic intervention [Korb et al., 2004], in which the marginal distri- 
bution of the intervened variable has positive variance. 

We require that the new SCM Č have an acyclic graph; the set of allowed inter- 
ventions thus depends on the graph induced by €. 


Code Snippet 6.9 The following code samples from an intervention distribu- 


tion. We consider the SCM € from Code Snippet 6.4 and perform the intervention 


do (X2 := 3); that is, we generate an i.i.d. sample from the distribution pow, 


# generate a sample from the intervention distribution 
set.seed(1) 

X3 <- runif(100)-0.5 

X1 <- 2*X3 + rnorm(100) 

# old: 

# X2 <- (0.5*X1)°2 + rnorm(100)-2 

X2 <- rep(3,100) 

X4 <- X2 + 2*sin(X3 + rnorm(100)) 


It turns out that the concept of interventions is a powerful tool to model differ- 
ences in distributions and to understand causal relationships. We try to illustrate 
this with some examples. 


4 Although the set of parents can change arbitrarily as long as they are not introducing cycles, we 
mainly consider interventions, for which the new set of parents PA, is either empty or equals PA,. 

5This is also referred to as an ideal, structural [Eberhardt and Scheines, 2007], surgical [Pearl, 
2009], independent, or deterministic [Korb et al., 2004] intervention. 

6 This is also referred to as a parametric [Eberhardt and Scheines, 2007] or dependent interven- 
tion [Korb et al., 2004] or simply as a mechanism change [Tian and Pearl, 2001]. For the term soft 
intervention, see Eberhardt and Scheines [2007] , Eaton and Murphy [2007], and Markowetz et al. 
[2005]. 
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Example 6.10 (Predictors and intervention targets) This example considers 
prediction. It shows that even though some variables may be good predictors for 
a target variable Y, intervening on them may leave the target variable unaffected. 
Consider the SCM € 


Xı := Nx, 
ie O+-O+@ 
X2 :=Y +Ny 


with Nx, ,Ny AN (0,1) and Ny, ~ N (0,0.1) being jointly independent. Assume 


that we are interested in predicting Y from X, and X2. Clearly, Xz is a better predic- 
tor for Y than X; is; for example, a linear model without X3 leads to a (significantly) 
larger mean squared error than a linear model without X; would. If we want to 
change Y , however, interventions on X2 are useless: 


a 


= PS for all variables Ñ; 

in other words, no matter how strongly we intervene on X2, the distribution of Y 
remains unaffected. An intervention on X;, however, does change the distribution 
of Y: 


a aa 


if Py # Pry, 


=N (E[Ny] +E[Ñ], var[Ny] + var[Ñ]) + PF 


This example can also be used to show that intervening is usually different from 

conditioning: 
py (y) = P$O) # prO =x). 

Example 6.11 (Myopia) The following case study is one example (out of many), 
in which a statistical dependence is mistakenly interpreted as a direct causal re- 
lationship. Humans seem to be particularly susceptible for such a false causal 
conclusion when little background knowledge is available. A study established a 
dependence between the usage of a night light in a child’s room and the occurrence 
of myopia [Quinn et al., 1999, page 113]. While the authors are cautious enough 
to say that the study “does not establish a causal link,” they add that “the statistical 
strength of the association ...does suggest that the absence of a daily period of 
darkness during early childhood is a potential precipitating factor in the develop- 
ment of myopia.” Based on these findings, a patent was filed [Peterson, 2005]. It 
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suggests that if we intervene on the variable night light, this changes the probability 
to develop myopia. 

Subsequently, Gwiazda et al. [2000] and Zadnik et al. [2000] found that the cor- 
relation is due to whether the child’s parents have myopia. They argue that myopic 
parents are more likely to put a night light in their child’s room, and at the same 
time, the child has an increased risk of inheriting the condition. Therefore, assume 
that the underlying (“correct”) SCM is of the form 


PM := Npm 
S: NL = f(PM,Nyz) 
CM := g(PM,Ncm) 


where PM stands for parent myopia, NL for night light, and CM for child myopia. 


The corresponding graph is 


In their paper, Quinn et al. [1999] found that NL IL CM, consistent with the model 
(assuming faithfulness — see Definition 6.33). Now we replace the structural as- 
signment of NL with NL := Ñy, where Nyz could randomly assign one out of 
the three night light conditions (“darkness,” “night light,” “room light”) with equal 
probability. In the corresponding intervention distribution 
€;do(NL:=Nyz) 
PytcM , 
we would find NL IL CM since CM := g(Npy,Ncm). This holds independently of 
the distribution of Ny. We say there is no causal effect from NL to CM. 


Motivated by the last statement in Example 6.11, we define the existence of a 
total causal effect [cf. Pearl, 2009, “total causal effect’’]. 


Definition 6.12 (Total causal effect) Given an SCM €, there is a total causal 
effect from X to Y if and only if 


sd :=N 
sir eer 


for some random variable Ñy. 
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There are concepts other than the one from Definition 6.12 that intuitively de- 
scribe the existence of a total causal effect. It turns out, however, that most of the 
statements one may have thought about are equivalent. The following proposition 
is proved in Appendix C.4. 


Proposition 6.13 (Total causal effects) Given an SCM €, the following statements 
are equivalent: 


(i) There is a total causal effect from X to Y. 
€;do(X:=x*) €;do(X:=x ) 


(ti) There are x~ and x~ such that Py £ Py 
;do(X:=x^ x 
(iii) There is x^ such that Py nese] + PỌ. 


€;do(X:=Ny ) 


(iv) X WY in Pyy for any Ny whose distribution has full support. 


Not surprisingly, the existence of a total causal effect is related to the existence 
of a directed path in the corresponding graph. The correspondence, however, is 
not one-to-one. While a directed path is necessary for a total causal effect, it is not 
sufficient. 


Proposition 6.14 (Graphical criteria for total causal effects) Assume we are 
given an SCM € with corresponding graph G. 


(i) If there is no directed path from X to Y, then there is no total causal effect. 


(ii) Sometimes there is a directed path but no total causal effect. 
The proof can be found in Appendix C.5. 


Example 6.15 (Randomized trials) The definition of a causal effect is imple- 
mented in randomized trials. In those studies, one randomly assigns the treatment 
T according to Nr to a patient and, for example, observes the (binary) recovery 
variable R. Assume that T takes three possible values (T = 0: no medication, 
T = 1: placebo, and T = 2: drug of interest) and that Ñr randomly chooses one 
of these three possibilities: P(Nr = 0) = P(Nr = 1) = P(Nr = 2) = 1/3. In the 
SCM, such a randomization is modeled with observing data from the distribution 


peers) 


(Here, € denotes the original SCM without randomization.) If we then still find 
a dependence between the treatment and recovery, we conclude that T has a total 
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biochemical effect 


placebo effect 


Figure 6.3: Simplified description of randomized studies. T denotes the treatment, P and 
B the patient’s psychology and some biochemical state, and R indicates whether the patient 
recovers. The randomization over T removes the influence of any other variable on T, and 
thus there cannot be any hidden common cause between T and R. We distinguish between 
two different effects: the placebo effect via P and the biochemical effect via B. 


causal effect on the recovery. It may turn out, however, that there is a total causal 
effect independently of the type of drug. A simplified description can be found in 
Figure 6.3. A patient’s psychology (P) changes, when taking a pill independently 
of its content, which then affects the recovery. Let us assume that this placebo 
effect is the same for the placebo and the drug of interest. That is, the structural 
assignment for P satisfies 


fe(T =0,Np) £ fe(T =1,Np) = fp(T =2,Np). 


In pharmaceutical studies, we are more interested in the biochemical effect than the 
placebo effect. We therefore restrict the randomization to be supported on placebo 
and drug of interest, that is, P(N; = 0) =0. If we then still see a dependence 
between treatment T and recovery R, this must be due to a biochemical effect. 

The idea of using randomized trials for causal learning was described (using 
different mathematical language) by Peirce [1883] and Peirce and Jastrow [1885], 
and later by Neyman [see Splawa-Neyman et al., 1990, for a translated and edited 
version of the original article] and Fisher [1925]. Most of this work dealt with 
applications in agriculture. 

An early example of a randomized trial was performed by James Lind. During 
the eighteenth century, Great Britain lost more soldiers from scurvy than from 
enemy action; vitamin C and its relation to scurvy was still unknown. The Scottish 
physician James Lind (1716-1794) worked as a surgeon on a ship and reports the 
trial as follows [cited after Bhatt, 2010]: 


On the 20th of May 1747, I selected twelve patients in the scurvy, 
on board the Salisbury at sea. Their cases were as similar as I could 
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have them. They all in general had putrid gums, the spots and lassi- 
tude, with weakness of the knees.... Two were ordered each a quart of 
cyder a day. Two others took twenty-five drops of elixir vitriol three 
times a day.... Two others took two spoonfuls of vinegar three times a 
day.... Two of the worst patients were put on a course of sea-water... 
Two others had each two oranges and one lemon given them every 
day.... The two remaining patients, took ... an electary recommended 
by a hospital surgeon.... The consequence was, that the most sudden 
and visible good effects were perceived from the use of oranges and 
lemons; one of those who had taken them, being at the end of six days 
fit for duty. 


The reader will notice that the trial was not fully randomized, but the historical 
curiosity makes up for it. 


Example 6.16 (Kidney stones) ‘Table 6.1 shows a famous data set from kidney 
stone recovery [Charig et al., 1986]. Out of 700 patients, one half was treated 
with open surgery (treatment T = a, 78% recovery rate) and the other half with 
percutaneous nephrolithotomy (T = b, 83% recovery rate), a surgical procedure to 
remove kidney stones by a small puncture wound. If we do not know anything 
else than the overall recovery rates, and neglect side effects, for example, many 
people would prefer treatment b if they had to decide. Observing the data in more 
detail, we can categorize kidney stones into small and large stones. We realize 
that the open surgery performs better in both categories. How do we deal with this 
inversion of conclusion? 

We first give an intuitive explanation. Larger stones are more severe than small 
stones (see Table 6.1), and treatment a had to deal with many more of these difficult 
cases (even though the total number of patients assigned to a and b are equal). This 
is why treatment a can look worse than b on the full population but better in both 
subgroups. The imbalance in assignment could, for example, arise if the medical 
doctors expect treatment a to be better than treatment b and therefore assign the 
difficult cases to treatment a with higher probability. 

As an alternative point of view, we propose to use the language of interventions 
to formulate the precise question we are interested in. And this is not whether 
treatment T = a or treatment T = b was more successful in this particular study 
but how the treatments compare when we force all patients to take treatment a 
or treatment b, respectively, or we compare the recovery rates, when each patient 
is assigned randomly to one of the treatments. These three situations concern an 
intervention distribution that is different from the observational distribution Px. In 
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Patients with Patients with 
small stones large stones 


78% (273/350) 939% (81/87) 73% (192/263) 


Overall 


Treatment a: 
Open surgery 


Treatment b: 
Percutaneous 83% (289/350) 87% (234/270) 69% (55/80) 


nephrolithotomy 


Table 6.1: A classic example of Simpson’s paradox. The table reports the success rates of 
two treatments for kidney stones [Bottou et al., 2013, Charig et al., 1986, tables I and II]. 
Although the overall success rate of treatment b seems better (any bold number is largest 
in its column), treatment b performs worse than treatment a on both patients with small 
kidney stones and patients with large kidney stones (see Examples 6.37 and Section 9.2). 


particular, they correspond to P&4(7:=4), p€:do(T:=b) or pe4o(T=Nr) We will 
compute these intervention distributions in Example 6.37, and we will see why we 
should prefer treatment a over treatment b. This data set is a famous example of 
Simpson’s paradox [Simpson, 1951] (Section 9.2). In fact, it is much less a paradox 
than the result of the influence of confounding, that is, a hidden common cause. 

If you perform a significance test on the data (e.g., using a proportion test or 77 
independence test), it turns out that the difference in methods is not significant at 
5% significance level. Note, however, this is not the point of this example. By 
multiplying each entry in Table 6.1 by a factor of 10, the results would become 
statistically significant. Also, we concentrate on the recovery R and ignore possible 
side effects that might influence our decision of treatment, too. 


Intervention variables We now describe an alternative approach to formalize 
interventions; see, for example, Dawid [2015] or Pearl [2009, Chapter 3.2.2]. One 
augments the SCM € and therefore its DAG with parentless nodes ),h,...,1g, 
called “intervention variables,” pointing at X1,...,Xq, respectively. For simplicity, 
we only discuss interventions on single nodes here. Every 7; attains either the value 
idle or one of the possible values x; that X; can attain. Then J; = x; means that X; 
is set to the value xj, while J; = idle denotes that X; has not been intervened on. 
Accordingly, one replaces the structural assignments 


Xj = fj (PA;,Nj) 
with 


y J Fi(PA}N;) ifl; = idle 
a I; otherwise 
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and adds assignments for /,,...,Ja, all of which are determined only by noise vari- 
ables. After assigning non-zero probability (or probability density) to all possible 
values of /;, the intervention probabilities entailed by the original SCM € turn into 
usual conditional probabilities in the augmented SCM ¢*: 
€;do(X;:=x;) e 

Py = Py jax 
see Remark 6.40. Moreover, the statement on whether an intervention on a variable 
changes the distribution of a certain target variable turns into a usual statistical 
independence statement. 


6.4 Counterfactuals 


The definition and interpretation of counterfactuals has received a lot of attention 
in the literature. They deal with the following situation: Assume you are playing 
poker and as a starting hand you have #&J and &3 (sometimes called a “lumberjack” 
— tree and a jack); you stop playing (“fold”) because you estimate the probability 
of winning to be too small and you do not want to lose even more money. Three 
more cards are dealt face-up to the board (“flop”). They are &4, &Q, and &2. 
The reaction is a typical counterfactual statement: “If I had stayed in the game, 
my chances would have been good.” (Five cards of the same suit is the fifth- 
highest hand and is called a “flush,” there are even chances for a “straight flush,” 
the second-highest hand.) This statement incorporates the observed data (cards 
in hand and flop) into the model and then analyzes an intervention distribution 
(stay in the game), in which the rest of the environment remains unchanged (same 
cards). Formally, this corresponds to updating the noise distributions of an SCM 
(by conditioning) and then performing an intervention. 


Definition 6.17 (Counterfactuals) Consider an SCM € := (S, Px) over nodes 
X. Given some observations x, we define a counterfactual SCM by replacing the 
distribution of noise variables: 


Cxax i= (S.A), 


‘| X= P 5 oe A 
where Py x=x. Py xx: The new set of noise variables need not be jointly inde- 
pendent anymore. Counterfactual statements can now be seen as do-statements in 


the new counterfactual SCM. 


TIn the continuous case, this definition comes with measure theoretic problems since usually the 
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This definition can be generalized such that we observe not the full vector X = x 
but only some of the variables. 


Example 6.18 (Computing counterfactuals) Consider the following SCM: 


X:= WN; X 

Y :=X°+Ny 

Z:=2.Y+X+Nz 
with Nx, Ny, Nz iq U({—5,—4,...,4,5}) that are uniformly distributed on the inte- 
gers between —5 and 5. Now, assume that we observe (X,Y,Z) = (1,2,4). Then 
peik=x puts a point mass on (Nx, Ny,Nz) = (1,1,—1) because here all noise terms 
can be uniquely reconstructed from the observations. We therefore have the coun- 
terfactual statement (in the context of (X,Y,Z) = (1,2,4)): “Z would have been 
11 had X been 2.” In this book, such a sentence is interpreted as: “Z would have 
been 11 had X been set to 2” Mathematically, this means that Py pee ae hag 
a point mass on 11. In the same way, we obtain “Y would have been 5, had X been 
2, and “Z would have been 10, had Y been 5.” 


Since the construction of counterfactuals involves several steps, its notation looks 
quite complicated.’ We hope that the following image provides further clarifica- 
tion. 


2. the observed data X = x 


1. the SCM € we start with 3. the intervention do (Y := 2) 


T wer 


4. the variable Z we are interested in 


conditional distribution is only defined up to null sets. To make our life easier, we restrict counterfac- 
tuals to the discrete case, that is, when the noise distribution has a probability mass function. In the 
case of continuous variables with density, we condition not on X = x but on X € A with P(X € A) > 0 
instead. 

8Pearl [2009] uses the somewhat simpler notation Z,(u), where the subscript y denotes the in- 
tervention do (Y := y) and u represents the additional information about the error terms, which he 
calls u, that may be implied by X = x, for example. 
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Counterfactual statements depend strongly on the structure of the SCM. Exam- 
ple 6.19 shows two SCMs that induce the same graph, observational distributions, 
and intervention distributions but entail different counterfactual statements. Later, 
we will call those SCMs “probabilistically and interventionally equivalent” but not 
“counterfactually equivalent” (see Definition 6.47). 


Example 6.19 Let N;,N2 ~ Ber(0.5), and N3 ~ U({0,1,2}), such that the three 
variables are jointly independent. That is, N1 ,M2 have a Bernoulli distribution with 
parameter 0.5 and N3 is uniformly distributed on {0,1,2}. We define two different 
SCMs. First consider €4: 


Xı = N 
Xa = No 
X3 := (1n,50°X1 + 1m=0 X2): Ly ex, +N3- Lx,=x)- 


If X; and X2 have different values, depending on N3 we either choose X3 = X; or 
X3 = Xz. Otherwise X3 = M3. Now, €p differs from €; only in the latter case: 


Xı := N; 
X2 := No 
X3 := (1n,>0°X1 + 1m;=0 X2): Ly ex, + (2—N3)- lx,=x- 


Both SCMs entail the same observational distribution; and for any possible inter- 
vention they entail the same intervention distributions, too.? But the two mod- 
els differ in a counterfactual statement. Suppose, we have made an observation 
(X1, X2,X3) = (1,0,0) and we are interested in the counterfactual question “what 
would X; have been if X; had been 0?” From both SCMs, it follows that N3 = 0, 
and thus the two SCMs €y and €g “predict” different values for X3 under a coun- 
terfactual change of X; (namely 0 and 2, respectively). 


The implications from the preceding example are twofold: (1) Both SCMs cor- 
respond to the same causal graphical model (see Section 6.5.2), and in this sense, 
causal graphical models are not rich enough to predict counterfactuals. (2) In Sec- 
tion 6.8, we relate intervention distributions to real-world randomized experiments. 


°In this example, the observational distribution satisfies causal minimality with respect to the 
underlying graph (here X; — X3 «+ X2); see Definition 6.33. Another example can be found in 
Section 3.4; it is less complex but violates causal minimality. 
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For this example, we cannot use randomized trials or observational data to distin- 
guish between €,4 or €g. Thus, if we are interested in counterfactual statements, 
we require additional assumptions that let us distinguish between Cy or €p. 

We now summarize some properties of counterfactuals. 


Remark 6.20 (i) Counterfactual statements are not transitive. In Example 6.18 


(ii) 


(iii) 


(iv) 


we found that given the observation (X,Y,Z) = (1,2,4), 


“Y would have been 5, had X been 2,” 
“Z would have been 10, had Y been 5,” and 
“Z would have not been 10, had X been 2.” 


Therefore, we cannot simply introduce new variables X and Y, say, and in- 
terpret the statement “Y would have been 5, had X been 2” as a logical 
implication of the form “X = 2 + Y =5.” In the preceding example, the 
non-transitivity is due to the direct link from X to Z, that is, the existence of 
a path from X to Z that does not pass Y. A similar counterexample holds for 
intervention distributions. 


Humans often think in counterfactuals: “I should have taken the train.”, “Do 
you remember our flight to New York on September 11, 2000? Imagine if 
we would have taken the flight one year later!” or “We should have invested 
in CHF in December 2014!” are only a few examples. Interestingly, this 
sometimes even concerns situations in which we made optimal decisions — 
based on the available information. Assume someone offers you $10,000 if 
you predict the result of a coin flip; you guess “heads” and lose. Some people 
may then think, “Why did I not say ‘tails’?” even though there was no way 
they could have possibly known the outcome. Roese [1997], Byrne [2007], 
and others provide the psychological implications of counterfactual thinking. 
Discussing whether counterfactual statements contain any information that 
can help us make better decisions in the future is interesting but lies beyond 
this work; see also Pearl [2009, Chapter 4]. 


We do not discuss the role of counterfactuals in our legal system either; it is 
an interesting question whether and how counterfactuals should be taken as 
a basis of verdicts (see Example 3.4). 


People have been thinking about counterfactuals for a long time; it is a pop- 
ular tool of historians. Titus Livius, for example, discusses in 25 BC what 
would have happened if Alexander the Great had not died in Asia and had 
attacked Rome [Geradin and Girgenson, 2011]. Paul’s First Epistle to the 
Corinthians (7:29-7:31) states: “But I say this, brothers: the time is short, 
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that from now on, both those who have wives may be as though they had 
none; / and those who weep, as though they didn’t weep; and those who re- 
joice, as though they didn’t rejoice; and those who buy, as though they didn’t 
possess; / and those who use the world, as not using it to the fullest.” 


(v) We can think of interventional statements as a mathematical construct for 
(randomized) experiments. For counterfactual statements, there is no compa- 
rable correspondence in the real world. One may speculate that many coun- 
terfactual statements cannot be falsified and should therefore not be used 
in scientific inquiry [cf. Popper, 2002]. Note, however, that sometimes we 
can make falsifiable counterfactual statements (for example, when the actual 
value of the noise terms for the respective instance in the sample becomes 
apparent in retrospect; see Example 3.4). Moreover, the counterfactuals we 
described above are consequences of positing an SCM. Another target of fal- 
sification can therefore also be the SCM rather than a given counterfactual 
statement. This may or may not be possible, for example, using methods 
from a scientific domain that the SCM refers to. !° 


These remarks can be considered as food for thought. We do not go into further 
depth regarding the interpretation of counterfactual statements and how they should 
or can be used in court cases, for example. Many of these deliberations lie outside 
our field of expertise. Instead, we refer to Halpern [2016] who discusses what it 
means that some event was an “actual cause” of some other event. 


6.5 Markov Property, Faithfulness, and Causal 
Minimality 


6.5.1 Markov Property 


The Markov property is a commonly used assumption that forms the basis of 
graphical models. When a distribution is Markovian with respect to a graph, this 
graph encodes certain independences in the distribution that we can exploit for ef- 
ficient computation or data storage. The Markov property exists for both directed 
and undirected graphs, and the two classes encode different sets of independences 


10Note that the freedom of reparametrization, as described in Section 3.4, always remains. 
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[Koller and Friedman, 2009]. In causal inference, however, we are mainly inter- 
ested in directed graphs. Many introductions to causal inference start by postu- 
lating the Markov property. Instead, in this book, we assume the existence of an 
underlying SCM. We will see in Proposition 6.31 that this is sufficient for proving 
the Markov property. But first, let us define it. 


Definition 6.21 (Markov property) Given a DAG G and a joint distribution Px, 
this distribution is said to satisfy 
(i) the global Markov property with respect to the DAG G if 


A lgB|C > ALB|C 


for all disjoint vertex sets A,B,C (the symbol lg denotes d-separation — 
see Definition 6.1), 


(ii) the local Markov property with respect to the DAG G if each variable is 
independent of its non-descendants given its parents, and 


(iii) the Markov factorization property with respect to the DAG G if 
d 
PS) =p) [] p(;|pa’). 
jel 


For this last property, we have to assume that Px has a density p; the fac- 
tors in the product are referred to as causal Markov kernels describing the 
conditional distributions Py pag: 


It turns out that as long as the joint distribution has a density,!! these three defi- 
nitions are equivalent. 


Theorem 6.22 (Equivalence of Markov properties) Zf Px has a density p, then 
all Markov properties in Definition 6.21 are equivalent. 


The proof can be found as Theorem 3.27 in Lauritzen [1996], for example. 


Example 6.23 A distribution Py, x, x,,.x, is Markovian with respect to the graph G 
shown in Figure 6.1 on page 84 if, according to (i) or (ii), 


X2 I X3|Xı and Xı JL X4|X2,X3, 


lIn this book, we always consider densities with respect to a product measure. 


102 Chapter 6. Multivariate Causal Models 


or, according to (iii), 


P(X1,X2,%3,%4) = p(x3)p(x1 | x3) p(x2 | x1) p(x4 |x2,x3). 


We will see later in Proposition 6.31 that a distribution entailed from an SCM is 
Markovian with respect to the graph of the SCM. Therefore, these conditions are 
indeed satisfied for a distribution Py, .x,,x;,x, entailed by the SCM as in Figure 6.1, 
left. Intuitively, the statement X> IL X; |X; is reasonable. Considering the path 
X2 + Xı + X3, we have that X3 does not provide any new information about X2 
if we already know X4. In this sense, the graph structure of an SCM leaves some 
“traces” in the joint distribution. 


The Markov condition relates statements about graph separation to conditional 
independences. It is possible, however, that different graphs encode the exact same 
set of conditional independences. 


Definition 6.24 (Markov equivalence of graphs) We denote by M(G) the set of 
distributions that are Markovian with respect to G: 


M(G) := {P:P satisfies the global (or local) Markov property with respect to G}. 


Two DAGs G; and Gz are Markov equivalent if M(G,) = M(G2). This is the case 
if and only if G, and Gz satisfy the same set of d-separations, which means the 
Markov condition entails the same set of (conditional) independence conditions. 

The set of all DAGs that are Markov equivalent to some DAG is called Markov 
equivalence class of G. It can be represented by a completed PDAG that is denoted 
by CPDAG(G) = (V,€); it contains the (directed) edge (i, j) € E if and only if one 
member of the Markov equivalence class does; see Figure 6.4. 


From this definition, determining whether two DAGs are Markov equivalent ap- 
pears a non-trivial problem. Fortunately, Verma and Pearl [1991] provide a concise 
characterization, see also Frydenberg [1990]. 


Lemma 6.25 (Graphical criteria for Markov equivalence) Two DAGs G, and G2 
are Markov equivalent if and only if they have the same skeleton and the same 
immoralities. 


Here, three nodes A, B, and C in a DAG form an immorality or v-structure if 
A — B + C and A and C are not directly connected (see Section 6.1). 

Figure 6.4 shows an example of two Markov equivalent graphs (center and left). 
The graphs share the same skeleton and both of them have only one immorality: 
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Figure 6.4: Two Markov equivalent DAGs (left and center); these are the only two DAGs 
in the corresponding Markov equivalence class that can be represented by the CPDAG on 
the right-hand side. 


X — Z & V. In the corresponding CPDAG (see Figure 6.4, right), not all directed 
edges are part of an immorality. The edge Z — Y, for example, is required to avoid 
a v-structure Y — Z + V. Furthermore, X — Y prevents the existence of a directed 
cycle. 

We now introduce the graphical concept of a Markov blanket [Pearl, 1988] that 
becomes relevant when one tries to predict the value of a target variable Y from the 
observed values of all the other variables. One may then wonder what would be the 
smallest set of variables whose knowledge renders the remaining ones irrelevant for 
the prediction task. 


Definition 6.26 (Markov blanket) Consider a DAG G = (V,€) and a target node 
Y. The Markov blanket of Y is the smallest set M such that 


Y Ig V\({Y}UM) given M. 
If Px is Markovian with respect to G, then 
Y IL V\({Y}UM) given M. 


In other words, given M, the other variables do not provide any further informa- 
tion about Y. In an idealized regression setting, we thus only need to include the 
variables in M for predicting Y. This does not imply that in a finite sample setting, 
the other variables are useless. If the dependence from Y on its Markov blanket 
M is not well aligned with the prior or function class used by the given regression 
method, adding variables outside M may improve the prediction of Y. 

For DAGs, we know what the Markov blanket looks like. It contains not only the 
parents, but also children and parents of children [Pearl, 1988]. 


Proposition 6.27 (Markov blanket) Consider a DAG G and a target node Y. 
Then, the Markov blanket M of Y includes its parents, its children, and the parents 
of its children 

M = PAy UCHy UPAcy,- 
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So far, we have discussed the Markov property as relating distributions and 
graphs. Now, we would like to discuss some of its causal implications. The Markov 
property can be used to justify Reichenbach’s common cause principle (Princi- 
ple 1.1). Recall that it states that when the random variables X and Y are dependent, 
there must be a “causal explanation” for this dependence: 

(i) X is (possibly indirectly) causing Y, or 
(ii) Y is (possibly indirectly) causing X, or 
(iii) there is a (possibly unobserved) common cause Z that (possibly indirectly) 
causes both X and Y. 
Here, we have not further specified the meaning of the word “causing.” The fol- 
lowing proposition justifies Reichenbach’s principle with respect to a weak notion 
of “causing,” namely the existence of a directed path. 


Proposition 6.28 (Reichenbach’s common cause principle) Assume that any 
pair of variables X and Y can be embedded into a larger system in the following 
sense. There exists a correct SCM over the collection X of random variables that 
contains X and Y with graph G. Then Reichenbach’s common cause principle 
follows from the Markov property. If X and Y are (unconditionally) dependent, 
then there is 

(i) either a directed path from X to Y, or 


(ii) from Y to X, or 
(iii) there is a node Z with a directed path from Z to X and from Z to Y. 


Proof. Due to the Markov property, the dependence implies that G contains an 
unblocked path between X and Y. This path cannot contain a collider, for other- 
wise it would be blocked by the empty set. The statement follows since any path 
between X and Y without collider must be of the form X >... > Y, X <—...<Y, 
or X <—...<-Z-... >Y. 


Remark 6.29 (Selection bias) In Reichenbach’s principle, we start with two de- 
pendent random variables and obtain a valid statement. In real applications, how- 
ever, it might be that we have implicitly conditioned on a third variable (selection 
bias). As Example 6.30 shows, this may lead to a dependence between X and 
Y, although none of the three conditions hold (see also the discussion in the last 
paragraph of Section 1.3). 


Example 6.30 (Berkson’s paradox) The following example “Why are handsome 
men such jerks?” is taken from Ellenberg [2014] and is an instance of Berkson’s 
paradox [Berkson, 1946]. Let us assume that whether men are in a relationship 
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(R = 1) is determined only by whether they are handsome (H = 1) and whether they 
are friendly (F = 1). More precisely, assume that the correct SCM has the form: 


H:=Nzu, 
F as ON DD 
R:=min(H,F) Nr, R) 


where Ny, Nr z Ber(0.5) and Nr ~ Ber(0.1). The symbol © denotes addition mod- 
ulo 2. In this model, a man is very likely to be in a relationship if he is handsome 
and friendly. Otherwise, he is likely to be single. As we can see from the SCM, 
H and F are assumed to be independent. If you consider men, however, that are 
not in a relationship, that is, you condition on R = 0, the characteristics, whether a 
man is friendly or handsome, become anti-correlated. If someone is handsome, he 
is more likely to be unfriendly (otherwise he would be in a relationship). We have 
that 


FY H|R=0 


and therefore F is not independent of H given R. 


As we have mentioned before, Pearl [2009] shows in Theorem 1.4.1 that the law 
Px induced by an SCM is Markovian with respect to its graph [see also Verma and 
Pearl, 1988]. 


Proposition 6.31 (SCMs imply Markov property) Assume that Px is induced by 
an SCM with graph G. Then, Px is Markovian with respect to G. 


The assumption that a distribution is Markovian with respect to the causal graph 
is sometimes called the causal Markov condition; this requires the notion of a 
causal graph. For us, causal graphs are induced by the underlying SCM. The con- 
cept of causal graphical models, on the other hand, uses them as a starting point 
for causal inference. 


6.5.2 Causal Graphical Models 


We will see in Section 6.6 that for defining intervention distributions, it usually suf- 
fices to have knowledge of the observational distribution and the graph structure. 
We therefore define a causal graphical model as a pair that consists of a graph and 
an observational distribution such that the distribution is Markovian with respect to 
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the graph (causal Markov condition). There is a subtle technicality, however. For- 
mally, we need to have access to the full conditionals. If p(x2|x1 = 3) is not defined, 
for example, because p(x; = 3) = 0, we may not be able to define p??“'=?) (x2). 
This motivates the following definition: 


Definition 6.32 (Causal graphical model) A causal graphical model over ran- 
dom variables X = (X\,...,Xq) contains a graph G and a collection of functions 
fj(Xj,Xpqg) that integrate to 1: 

Ñ J 


ETOL =S 


These functions induce a distribution Px over X via 
d 
p(x,- Sa Xa) _ [] fiGi-*pa9): 
j=l 


and thus play the role of conditionals: fj(xj,Xpag ) = P(Xj|Xpag)- A causal graphi- 
J 

cal model induces intervention distribution according to Equations (6.8) and (6.9) 

in Section 6.6. In the most general form, we can define 


do| X;:=q(- | x5 
P a a tead) (Mijzenigmg) =1 13, (Xj,Xpag) al xma) 


Jék 
with q(- E A integrating to 1 and the new parents not leading to a cycle. 


If a distribution Px over X is Markovian with respect to a graph G and allows for 
a strictly positive, continuous density p, the pair (Px,G) defines a causal graphical 
model by Fil%j-Xpyg) = P(xj|Xpag)- 

Why do we primarily work with SCMs and not just with graphs and the Markov 
condition, that is, causal graphical models? Formally, SCMs contain strictly more 
information than their corresponding graph and law (e.g., counterfactual state- 
ments) and hence also more information than the family of all intervention dis- 
tributions together with the observational distribution. It is debatable, though, 
whether this additional information is useful. Maybe more importantly, restrict- 
ing the function class in SCMs can lead to identifiability of the causal structure 
(see Sections 4.1.3-4.1.6 and 7.1.2). Those assumptions are easier to phrase in the 
language of SCMs than in the language of graphical models. 
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6.5.3 Faithfulness and Causal Minimality 


In the previous subsection, we discussed the Markov assumption, which enables us 
to read off independences from the graph structure. Faithfulness allows us to infer 
dependences from the graph structure. 


Definition 6.33 (Faithfulness and causal minimality) Consider a distribution 
Px and a DAG G. 


(i) Px is faithful to the DAG G if 
AJL B|C => A tgB/C 


for all disjoint vertex sets A,B,C. 


(ii) A distribution satisfies causal minimality with respect to G if it is Markovian 
with respect to G, but not to any proper subgraph of G. 


Part (i) posits an implication that is the opposite of the global Markov condition 
AlgB|C > ALBIC, 


see Definition 6.21. Faithfulness is not very intuitive at first glance. We now give an 
example of a distribution that is Markovian but not faithful with respect to a given 
DAG G;. This is achieved by making two paths cancel each other and creating an 
independence that is not implied by the graph structure. 


Example 6.34 (Violation of faithfulness) Consider the following figure. 


G1 G2 H 
We first look at a linear Gaussian SCM that corresponds to the left graph G1. 


X := Nx, 


Y :=aX+WNy, 
Z:= bY +cX + Nz, 
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with normally distributed noise variables Ny ~ N(0,0%), Ny ~ N (0,02), and 
Nz ~ N (0,07) that are jointly independent. This is an example of a linear Gaus- 
sian SCM with graph G; (see Definition 6.2). Now, if 


a-b+c=0, (6.6) 
the distribution is not faithful with respect to G, since we obtain X JL Z, which is 


not implied by the graph structure.!* The reader can easily verify that there is an 
SCM with DAG G> inducing the same distribution. 


To obtain the extra independence in the preceding example, we had to “tune” 
the coefficients such that the two paths cancel each other out in (6.6). Spirtes et al. 
[2000, Theorem 3.2] show for linear models that this happens with zero probability 
if we assume that the coefficients are drawn randomly from positive densities. 

The distribution from Example 6.34 is faithful with respect to G2, but not with 
respect to G;. Nevertheless, for both models, causal minimality is satisfied if none 
of the parameters vanishes. In other words, the distribution is not Markovian to any 
proper subgraph of G; or G2 since removing any edge would correspond to a new 
(conditional) independence that does not hold in the distribution; note that G2 is 
not a proper subgraph of G1. It is a proper subgraph of H, however, and therefore, 
the distribution does not satisfy causal minimality with respect to H. In general, 
causal minimality is weaker than faithfulness. 


Proposition 6.35 (Faithfulness implies causal minimality) Jf Px is faithful and 
Markovian with respect to G, then causal minimality is satisfied. 


Proof. The argument is as follows: If Px is Markovian with respect to a proper 
subgraph G of G, there are two nodes that are directly connected in G but not in G. 
Thus, they can be d-separated in G but not in G (see Problem 6.62). The Markov 
condition implies the corresponding conditional independence statement in Px, and 
thus Px cannot be faithful with respect to G. 


The following formulation is equivalent to causal minimality and hopefully is of 
further help to understand the condition. A distribution is minimal with respect 
to G if and only if there is no node that is conditionally independent of any of its 
parents, given the remaining parents. In some sense, all the parents are “active.” 


More precisely, it is not triangle-faithful [Zhang and Spirtes, 2008]. 
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Proposition 6.36 (Equivalence of causal minimality) Consider the random vec- 
tor X = (X1,...,Xa) and assume that the joint distribution has a density with re- 
spect to a product measure. Suppose that Px is Markovian with respect to G. Then 
Px satisfies causal minimality with respect to G if and only if YX; VY € PAS we 


have that X; Jt Y | PAY VAY}. 


Proof. See Appendix C.6. 


We have seen that while faithfulness is a strong assumption that links condi- 
tional independence statements with causal semantics, causal minimality is a much 
weaker condition. Suppose we are given a causal graphical model, for example, in 
which causal minimality is violated. Then, one of the edges is “inactive” in the 
notion of Proposition 6.36. If we remove this edge, the two models do not need to 
be counterfactually or interventionally equivalent in the sense of Definition 6.47. 
They are interventionally equivalent, however, if all densities are strictly positive 
(or if we only allow for interventions on Xz that are supported on a subset of the 
support of Xx); see Problem 6.58. Then, causal minimality could be interpreted as 
the convention to avoid redundancies in the description of an interventional model. 
In most model classes, identifiability from observational data is impossible to ob- 
tain without causal minimality. We cannot distinguish between Y := f(X) + Ny 
and Y := c + Ny, for example, if f is allowed to differ from c only outside the 
support of X; see also Remark 6.6 and Proposition 6.49. 


6.6 Calculating Intervention Distributions by Covariate 
Adjustment 


In this section we will make use of a somewhat trivial but very powerful invariance 
statement. Given an SCM €, and writing pa(j) := PAY, we have 


P* (Xj |Xpacj)) = P(x; [Xpa(j)) (6.7) 


for any SCM € that is constructed from € by intervening on (some) X% but not 
on X;. Equation (6.7) shows that causal relationships are autonomous under inter- 
ventions; this property is therefore sometimes called “autonomy.” If we intervene 
on a variable, then the other mechanisms remain invariant (see the left box in Fig- 
ure 2.2). 

We deduce a formula from (6.7) that became known under three different names: 
truncated factorization [Pearl, 1993], G-computation formula [Robins, 1986], 
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and manipulation theorem [Spirtes et al., 2000]. Its importance stems from the 
fact that it allows us to compute statements about intervention distributions even 
though we have never seen data from it. 

Consider an SCM € with structural assignments 


Xj := fiX pa Nj); J= Ny esd, 


and density p*. Because of the Markov property, we have!? 
€ d € 
P (Xie : td) = BE (xj |[Xpa(j)): 
j=1 


Now consider the SCM € that evolves from € after do (Xx = Nx). where Ñ; allows 
for the density ñ. Again, it follows from the Markov assumption that 


poido(Xe=M) (x1, o. Xa) = [peee (x; | Xpa(j)) . poide(Xc=N) (xx) 
J#k 
= [P xpa) PE). (6.8) 
J#k 
In the last step, we make use of the powerful invariance (6.7). Equation (6.8) al- 
lows us to compute an interventional statement (left-hand side) from observational 
quantities (right-hand side). As a special case, we obtain 


P 


do =a j E(x; | xpa if x =e 
€; do(Xx:= )(x1,...,%a) = { ad 4 Í o) NS oo 


Usually, conditioning and intervening with do () are different operations (see the 
discussion after Example 6.10). We are now able to show that these operations 
become identical for variables that do not have any parents. Without loss of gener- 
ality, let us assume that X; is such a source node. We then have 


P(x1 =a) Ti- P(X; Xpat) 


P(x =a) 
ROE) y... xa). (6.10) 


Fd Cone =a) = 
=p 


Equations (6.8) and (6.9) are widely applicable but sometimes a bit cumbersome 
to use. We will now learn about some practical alternatives. Therefore, we first 
recall Example 6.16 (kidney stones) that we will then be able to generalize. 


13Note that the conditionals pë (x; |Xpa(j)) can be defined even for values x,,,,;) S-t. pE (Xpa(j)) =0. 


6.6. Calculating Intervention Distributions by Covariate Adjustment 111 


Example 6.37 (Kidney stones, continued) Assume that the true underlying SCM 
allows for the graph 


O—®) 

Here, Z is the size of the stone, T the treatment, and R the recovery (all binary). 
We see that the recovery is influenced by the treatment and the size of the stone. 
The treatment itself depends on the size, too. A large proportion of difficult cases 
was assigned to treatment A. Consider further the two SCMs €, and €g that we 
obtain after replacing the structural assignment for T with T := A and T := B, 
respectively. Let us call the corresponding resulting probability distributions P™ 
and P% , Given that we are diagnosed with a kidney stone without knowing its size, 
we should base our choice of treatment on a comparison between 


RSP (R = 1) = Perr (R = 1) 


and 


REER = p&(R = 1) = p&do(T:=8)(R — 1). 


Given that we have observed data from €, how can we estimate these quantities? 
Consider the following computation: 


1 
Pagal) = Į P™(R=1,T=4,Z =z) 
z=0 
1 
= Y P™(R=1|T =A,Z=z)P™(T =A,Z =z) 
z=0 
1 
= $ P™(R=1|T =4,Z =z) Pa 


(6.7) y Pe(R=1|T =å Z= PZ =). (6.11) 


The last step contains the key idea. Again, we have made use of the invari- 
ance (6.7). We can estimate P®4(R = 1) from the empirical data shown in Table 6.1 


and obtain 


35 


l 4 
P“ (R = 1) © 0.93 -5 i 


7 
00 ` 0.73 - 700 = 0.832. 
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Analogously, we obtain 


35 


P=(R=1) 0.87 -5 = 


7 
+ 0.69 - x 0.782, 
00 0.69 700 0.78 


and we conclude that we would rather go for treatment A. (As stated before, we 
ignore the question of statistical significance, which seems justified if we need to 
decide between A and B.) The quantity 


P% (R = 1)— P®(R = 1) © 0.832 — 0.782 (6.12) 


is sometimes called the average causal effect (ACE) for binary treatments. It is 
important to realize that this is different from simple conditioning: 


PĒ(R =1|T =A)—P*(R=1|T = B) =0.78 — 0.83, 


which, in this example, has even the opposite sign of the ACE. 


This three-node example nicely highlights the difference between intervening 
and conditioning. In terms of densities, it reads: 


pE) = Ppr # Edp Ee = Per). 


Equation (6.11) is called “adjusting” for the variable Z. It denotes an important 
concept that is often used in practice and that we formally define in Definition 6.38. 
It once more allows us to compute intervention statements from observed quanti- 
ties. Note that the derivation of the adjustment formula (6.11) is sometimes based 
on the truncated factorization (6.9), but we will see in Proposition 6.41 that the al- 
ternative computation using the invariance (6.11) nicely carries over to more com- 
plicated settings. 


Definition 6.38 (Valid adjustment set) Consider an SCM € over nodes V and let 
Y ¢ PAy (otherwise we have p©?°%'=) (y) = p€ (y)). We calla set Z C V\ {X,Y} 
a valid adjustment set for the ordered pair (X,Y ) if 


podeX:=a) (y) = L p*(y|x,z) p*(z). (6.13) 


Here, the sum (could also be an integral) is over the range of Z, that is, over all 
values z that Z can take. 


6.6. Calculating Intervention Distributions by Covariate Adjustment 113 


In Example 6.37, Z = {Z} is a valid adjustment set for (7,R). Adjusting for 
Z was necessary to compute the average causal effect. We have seen that simple 
conditioning led to false conclusions. In other words, the empty set was not a 
valid adjustment set. In such a case, we say that the causal effect from T to R is 
confounded. 


Definition 6.39 (Confounding) Consider an SCM € over nodes V with a directed 
path from X to Y, X,Y € V. The causal effect from X to Y is called confounded if 


pod) (y) £ p*(y|x). (6.14) 
Otherwise, the causal effect is called “unconfounded.” 


It is sometimes believed that one should make the adjustment set as large as 
possible to reduce the influence of potential confounders. This is, however, not 
always a good idea as demonstrated by Berkson’s paradox [Berkson, 1946] in Ex- 
ample 6.30. It shows that not all sets are valid adjustment sets and that sometimes 
it is better to not include a covariate in the adjustment set. Let us try to investigate 
which sets we can use for adjusting. We use the same idea as in Example 6.37 and 
write (for any set Z) 


qe (y) — £ go (y,z) 


= £ pE ama (y | x,Z) | ian (z). 


If we have 
petX9(y|x,2) = p®(y|x,2) and p&4C)(z) = p(z), (6.15) 


it follows (as before) that Z is a valid adjustment set. Property (6.15) states that 
the conditionals remain the same even after intervening on X; we say that they 
are invariant. We thus need to address the question of which conditionals remain 
invariant under the intervention do (X := x). 


Remark 6.40 (Characterization of invariant conditionals) Consider an SCM € 
with structural assignments 


Xj := fj(PA;,Nj) 


and an intervention do (X; := xx). Analogously to what is done in Pearl [2009, 
Chapter 3.2.2], for example, we can now construct a new SCM €* that equals € 
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but has one more variable / that indicates whether the intervention took place or 
not (see also the paragraph “Intervention Variables” in Section 6.3 on page 95). 
More precisely, J is a parent of X; and does not have any other neighbors. The 
corresponding structural assignments are 


I:=N, 
Xj := fj(PA;,Nj) for j #k 


{ fx(PA,,Ne) iff =0 
Xk = ‘ ; 
Xk otherwise 


where N has a Bernoulli distribution with P(J = 0) = P(J = 1) = 0.5, for example 
(other distributions work, too). Thus, Z = 0 corresponds to the observational setting 
and J = | to the interventional setting. More precisely, using Equation (6.10), we 
obtain 


€*;do(I:=0) ( 


př (x1,...,%q|f=0) =p Xise Xa) 


= Bott EE Xa) 
and similarly 
a E —™ (i aay): (6.16) 


Using the Markov condition for €*, it thus follows for variables A and a set of 
variables B that 


Alig I|B =  p®(alb,1=0) = p* (a|b,J=1) 
=> p*(a|b) = po 4% (ab). 


The right-hand side states that the distribution P4)g of the conditional A given B 
remains invariant under an intervention on X;,. 


We are now able to continue the argument from before. Equation (6.15) is satis- 
fied for sets Z, for which we have 


YigeI|X,Z and Zig. (6.17) 


The subscript G* means that the d-separation statement is required to hold in G*. 
Our deliberation immediately implies the first two statements of the following 
proposition: 
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io oe 
T 
oe 
O © W 


Figure 6.5: Only the path X + A — K — Y is a “backdoor path” from X to Y. The set 
Z = {K} satisfies the backdoor criterion (see Proposition 6.41 (ii)); but Z = {F,C,K} is 
also a valid adjustment set for (X,Y); see Proposition 6.41 (iii). 


Proposition 6.41 (Valid adjustment sets) Consider an SCM over variables X 
with X,Y € XandY ¢ PAy. Then, the following three statements are true. 


(i) “parent adjustment”: 
Z:=PAy 
is a valid adjustment set for (X,Y). 
(ii) “backdoor criterion”: Any Z C X\ {X,Y } with 


e Z contains no descendant of X AND 
e Z blocks all paths from X to Y entering X through the backdoor 
(X <..., see Figure 6.5) 


is a valid adjustment set for (X,Y). 
(iii) “toward necessity”: Any Z C X \ {X,Y} with 


e Z contains no descendant of any node on a directed path from 
X to Y(except for descendants of X that are not on a directed 
path from X to Y) AND 

e Z blocks all non-directed paths from X to Y 


is a valid adjustment set for (X,Y). 


Only the third statement [Shpitser et al., 2010, Perkovic et al., 2015] requires 
some explanation. Let us start with a valid adjustment set Z, for example, ob- 
tained via the backdoor criterion. We can then add any node Zo to Z that satisfies 
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Zo JL Y |X,Z because then 
$} p(y |x,z, 20) p(Z,20) = Lp ls2) Lp (z,z0) 


Z,Z0 
E? (y|x,z)p 


In fact, Proposition 6.41 (iii) characterizes all valid adjustment sets [Shpitser et al., 
2010]. 


Example 6.42 (Adjustment in linear Gaussian systems) Consider an SCM € 
over variables V with {X,Y },Z C V. Sometimes, we want to summarize a causal 
effect from X to Y by a single real number instead of looking at p&4°'=») (y) 
for all x. We have seen an example in the case of binary treatments X (see Equa- 
tion (6.12)). But what can be done in the case of continuous random variables? As 
a first approximation we may look at the expectation of this distribution and then 
take the derivative with respect to x: 


ð p; do(X:=x) 
5E [Y]. (6.18) 


In general, this is still a function of x. In linear Gaussian systems, however, this 
function turns out to be constant. Assume that Z is a valid adjustment set for (X,Y). 
If V has a Gaussian distribution, then the conditional Y |X = x,Z =z follows a 
Gaussian distribution, too; its mean is 


E[Y |X =x, Z = z| = ax + b'z (6.19) 


for some a and b. It follows from (6.13) (see Problem 6.63) that 


2 ESdo(X:=x) jy 
Ox [Y] 


It is possible to obtain the value of a in (6.19) in two different ways. (1) One can 
use the method of path coefficients: if there is exactly one directed path from X to 
Y, then a equals the product of the path coefficients. If there is no directed path, 
then a = 0 and if there are different paths, a can be computed using Wright’s for- 
mula [Wright, 1934]. (2) One can directly compute the conditional mean (6.19). 
If we are not given the joint distribution but rather a sample from it, we can esti- 
mate (6.20) by regressing Y on X and Z and then reading off the regression coeffi- 
cient for X (see also Code Snippet 6.43). 
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Code Snippet 6.43 The following code generates an i.i.d. sample of size n = 
100 from an SCM with the structure shown in Figure 6.5 (see the code for the 
coefficients). Since we know the underlying SCM, the true value of quantity (6.20) 
can be obtained by multiplying the path coefficients of the path X —> D— Y; in 
our example, it equals (—2)- (—1) = 2 (see lines 8 and 10 in the code). We can 
now pretend that the precise form of the structural assignments; that is, the set of 
coefficients is unknown but we are given the data sample and the graph structure 
of the SCM (see Figure 6.5) instead. We can then estimate the value (6.20) by 
regressing Y on X and an adjustment set Z. If Z is a valid adjustment set, we 
obtain an unbiased estimator. In the code, the adjustment set Z = @ leads to a 
biased estimator (see line 15); only the adjustment sets Z = {K} and Z = {F,C, K} 
are valid (see lines 19 and 23, respectively). 


# generate a sample from the distribution entailed by the SCM 
set.seed(1); n <- 100 
<- rnorm(n) 
<- 0.8*rnorm(n) 
<- A + 0.1*rnorm(n) 
<= C = 2*A + 0.2*rnorm(n) 
<- 3*X + 0.8*rnorm(n) 
-2*X + 0.5*rnorm(n) 
<- D + 0.5*rnorm(n) 
<- 2*K - D + 0.2*rnorm(n) 
<- 0.5*Y + 0.1*rnorm(n) 


HRMS QUNs REA 
nN 
I 


lm(Y~X)$coefficients 


A (CE SASSO) === X 

# 0.09724282 1.27941073 

# 

1m(Y~X+K) $coefficients 

HE INCET CCD) aaa r K 

# 0.01428974 2.07038809 2.16964827 

# 

1m(Y~X+F+C+K) $coefficients 

# (Intercept) -= Mme jj ece Gcm K 


# 0.01687018 1.90495456 0.05901385 -0.02260164 2.18276488 


We now briefly comment on propensity score matching [Rosenbaum and Rubin, 
1983]. The following remark repeats the argument given by Pearl [2009, 11.3.5]. 


Remark 6.44 (Propensity score matching) Consider an SCM over variables X = 
(X,Y,Z), with Z = (Z1 ,Z2,Z3) and the following graph. 


118 Chapter 6. Multivariate Causal Models 


D @ a 


| | 
O 


One can see that the set {Z1, Z2,Z3} is a valid adjustment set, for example, by 
parent adjustment (see Proposition 6.41). That is, 


neways E p oleme) P® (2122123). (6.21) 


£1522 23 


P 


Sometimes, however, the value of X does not depend on Z “directly” but only 
through a (real-valued) propensity score L := L(Z) = L(Z,,Z2,Z3). This means 
“X IL Z|L(Z),” or, more formally, s we have for all z,x and £ = L(z) that 


p(z|£,x) = p(z|£). 


If X is a binary choice that indicates treatment or no treatment, one may choose 
Lizj=pe=1 = = Z), for ae But then, it follows with ae 


peo) = Yi pO lz) EE plz) pO peels) 
ae (y|£,x,z) p° (£) pt (z| £x) 
=$ p€ 14x) pO. (6.22) 
£ 


In the population setting, both computations (6.21) and (6.22) of the intervention 
distribution are correct. The point is, however, that for finite data, (6.22) may lead 
to a better estimate than (6.21) would: although one needs to estimate the func- 
tion L, the resulting conditional p*(y|.x,@) is potentially lower dimensional than 
p*(y|x,z). In practice, one often matches realizations with a “similar” value of £ 
to compute (6.22). Important practical details include estimating of the function L 
and the matching procedure. The idea works for any number of covariates. 

In this sense, propensity score matching can be a nice and useful trick to gain 
statistical performance. It is irrelevant for population considerations. 


6.7 Do-Calculus 


Again, consider an SCM over variables V. Sometimes, we can compute interven- 
tion distributions p©¢?°= in other ways than the adjustment formula (6.13). Let 
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us therefore call an intervention distribution p© 4°") (y) identifiable if it can be 
computed from the observational distribution and the graph structure. If there is a 
valid adjustment set for (X,Y), for example, p®4°=") (y) is certainly identifiable. 
Pearl [2009, Theorem 3.4.1] has developed the so-called do-calculus that consists 
of three rules. Given a graph G and disjoint subsets X, Y, Z, and W, we have 


1. “Insertion/deletion of observations”: 


Cdo X =R) Edox] 


p y|Z,w) =p y|w) 


if Y and Z are d-separated by X, W in a graph where incoming edges in X 
have been removed. 


2. “Action/observation exchange”: 
gta (y | w) = a (y | Z, w) 


if Y and Z are d-separated by X, W in a graph where incoming edges in X 
and outgoing edges from Z have been removed. 


3. “Insertion/deletion of actions”: 
igen (y | w) = gS (y | w) 


if Y and Z are d-separated by X, W in a graph where incoming edges in X 
and Z(W) have been removed. Here, Z(W) is the subset of nodes in Z that 
are not ancestors of any node in W in a graph that is obtained from G after 
removing all edges into X. 


Theorem 6.45 (Do-calculus) The following statements hold. 


(i) The rules are complete; that is, all identifiable intervention distributions can 
be computed by an iterative application of these three rules [Huang and 
Valtorta, 2006, Shpitser and Pearl, 2006]. 

(ii) In fact, there is an algorithm, proposed by Tian [2002] that is guaranteed 
[Huang and Valtorta, 2006, Shpitser and Pearl, 2006] to find all identifiable 
intervention distributions. 

(iii) There is a necessary and sufficient graphical criterion for identifiability of 
intervention distributions [Shpitser and Pearl, 2006, Corollary 3], based on 
so-called hedges [see also Huang and Valtorta, 2006]. 


As acorollary of the do-calculus, we obtain the front-door adjustment (see Prob- 
lem 6.65). 
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Example 6.46 (Front-door adjustment) Let € be an SCM with corresponding 
graph 


LO 


If we do not observe U, we cannot apply the backdoor criterion. In fact, there is no 
valid adjustment set. But still, provided that p*(x,z) > 0, the do-calculus provides 
us with 


pear) (y) = L pte) Ep 0l) p8). (6.23) 


The fact that observing Z in addition to X and Y here reveals causal information 
nicely shows that causal relations can also be explored by observing the “channel” 
(here Z) that carries the “signal” from X to Y. 


Bareinboim and Pearl [2014] consider the problem of transportability. They are 
also interested in intervention distributions, but they allow for the possibility to 
include knowledge (i.e., observational distributions and intervention distributions) 
that has been gained in SCMs that coincide with the target SCM in some structural 
assignments and differ in others. 


6.8 Equivalence and Falsifiability of Causal Models 


So far, SCMs have been mathematical objects. To link them to reality, we regard 
them as models for a data-generating process. It can be a complicated class of 
models, though. Instead of modeling “just” a joint distribution (as we can model 
a physical process with a Poisson process, for example), we can now model the 
system in an observational state and under perturbations at the same time. We 
have seen that it is even possible to regard SCMs as models for counterfactual 
statements. 

More formally, consider a vector X = (X1,...,Xq) of random variables. A prob- 
abilistic model for X predicts an observational distribution Px. We call such a 
model an interventional model if it additionally predicts intervention distributions 
in which some variables X; have been set to (independent) variables N j. Finally, a 
counterfactual model additionally predicts the result of counterfactual statements. 
Traditional machine learning methods, for example, build probabilistic models; 
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causal graphical models (Definition 6.32) can be used as interventional models, 
and SCMs can be used as counterfactual models. We call two models equivalent if 
they agree on the corresponding predictions [see Bongers et al., 2016] for a similar 
construction. 


Definition 6.47 (Equivalence of causal models) Two models are called 
{probabilistically /interventionally / counterfactually} equivalent 


if they entail the same {obs. / obs. and int. / obs., int., and counterf. } distributions. 


It is apparent that the notion of interventional equivalence applies only to inter- 
ventional and counterfactual models, for example. Proposition 7.1 implies that for 
each probabilistic model, there is an observationally equivalent SCM. 

If X has a strictly positive density, Proposition 6.48 shows that we can restrict the 
notion to interventions on single nodes, that is, interventions in which a variable X; 
has been set to a variable Ñ; where the distribution of Ñ; has full support. If two 
models agree on this subclass of interventions, they agree on all other interventions, 
too. The rationale is that interventions on single nodes, correspond to the standard 
version of randomized experiments. 

For a given data-generating process, we can now falsify a probabilistic or in- 
terventional model if the corresponding distributions do not agree with the data 
observed from the process. That is, if an interventional model predicts the obser- 
vational distribution correctly but does not predict what happens in a randomized 
experiment, the model is still considered to be falsified. This notion includes the 
assumption that there is an agreement about what a randomized experiment should 
look like. One should be careful about writing down an SCM when it is unclear 
how to randomize over the involved variables in reality (or perform interventions 
on them). The notion of falsifiability further requires the concept of (statistical) 
significance, which is not discussed here. We do not include counterfactual mod- 
els, since they are hard to falsify in general. We could falsify them based on their 
implications on observational distributions and intervention distributions (see Sh- 
pitser and Pearl [2008a] and references therein). In some specific experimental 
setups, it is furthermore possible to construct counterfactual statements that are 
falsifiable (see Example 3.4). Example 6.19, however, shows two SCMs that entail 
the same observational and intervention distributions but entail different counter- 
factual statements. 

The above-mentioned restriction to a subclass of interventions (single variables 
are set to a noise variable) serves a practical purpose. To check the validity of 
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the model we have to compare the outcome of randomized experiments with the 
model’s predictions. For more complex interventions, the corresponding experi- 
ments in reality seem more complicated to implement. The following proposition 
states that this comes without loss of generality: if causal models agree on all 
single-node interventions, they are interventionally equivalent. The proof can be 
found in Appendix C.7. 


Proposition 6.48 (Interventional equivalence) Assume that two SCMs (or causal 
graphical models) €; and € induce strictly positive, continuous conditional den- 
sities p(Xj|Xpa(j)), Where pa(j) := PAy,, and satisfy causal minimality. Assume 
further that they entail the same intervention distributions, in which some variable 
X; has been set to a variable N j with full support: 
€);do(X;:=N;) €y;do(X;:=N;) wer : 
Px = Px Vj YN; with full support. 

Then, € and © are interventionally equivalent; that is, they agree on any possible 
intervention, including atomic interventions or interventions in which the set of 
parents is altered (without creating a cycle). 


If the density is not strictly positive, this is not necessarily the case. One may 
then have to consider simultaneous interventions on several nodes (e.g., double 
knockout gene experiments); see Problem 6.59. 

Furthermore, we are now able to justify the notion of structural minimality of 
SCMs (see Remark 6.6). We have argued that if the function in a structural assign- 
ment of an SCM does not depend on one of the inputs, we can choose a sparser 
representation. The following proposition formalizes in what sense these represen- 
tations are equivalent. 


Proposition 6.49 (Counterfactual equivalence) Consider two SCMs € and €* 
that share the same noise distribution Py and that differ only in the kth structural 
assignment: 


fic(pay,, Mk) = fy (Pag, Mk), Vpa,, Yng with p(ng) > 0, (6.24) 
with PA; © PA,. Then, both SCMs are counterfactually equivalent. 


The proof is provided in Appendix C.8. 


6.9 Potential Outcomes 


We now introduce an alternative approach to causal inference that is not based on 
SCMs. The framework is often referred to as potential outcomes or the Rubin 
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causal model and is widely used in the social sciences. The ideas date back to 
Neyman [1923] and Fisher [1925] who mainly discussed randomized experiments. 
Rubin [1974] extended the ideas to observational studies. Rubin [2005], Morgan 
and Winship [2007], and Imbens and Rubin [2015] provide more elaborate intro- 
ductions into the topic. 


6.9.1 Definitions and Example 


To explain potential outcomes, we revisit Example 3.4 (the eye doctor) and refor- 
mulate it in this framework. Rather than with random variables, we now start with 
a group of n patients (or units) u = 1,...,n, each of which may or may not receive 
the treatment. We assign two potential outcomes to each patient u: B,(t = 1) 
indicates whether the patient would go blind (B = 1) or get cured (B = 0) if she 
receives treatment (T = 1). Analogously, B,,(t = 0) encodes what happens without 
treatment (T = 0). Both of these potential outcomes are assumed to be determin- 
istic. For each patient the treatment either helps or it does not help: there is no 
randomness involved. If B,(t = 1) = 0 and B,(t = 0) = 1, we say that the treat- 
ment has a positive effect for unit u. 

In practice, however, we are not able to check these conditions. The “fundamen- 
tal problem of causal inference” [Holland, 1986] states that for each unit u we can 
observe either B„(t = 1) or B,(t = 0) and never both of them at the same time. The 
reason is that after we have chosen to treat a person, we cannot go back in time and 
undo the treatment. This even holds the other way around. If we decide to not give 
a treatment, we can still apply the treatment later in time but this cannot be inter- 
preted as an outcome of the variable B,,(t = 1) anymore. The patient might have 
recovered in the meantime by herself, for example. Thus, we can observe only one 
of the potential outcomes; the unobserved quantity becomes a counterfactual. 

Table 6.2 shows a (hypothetical) data set for the previous example. In fact, the 
data points are sampled according to the model described in Example 3.4. To 
justify the presentation in Table 6.2, we often implicitly assume the stable unit 
treatment value assumption (SUTVA) [Rubin, 2005]. It states that the units do 
not interfere (e.g., the potential outcome of a unit does not depend on which treat- 
ment any other unit received) [Cox, 1958]; furthermore it requires that the potential 
outcomes do not depend on how or why the treatment has been received. We will 
see in Section 6.9.2 that SUTVA is satisfied when the data are generated from an 
SCM (as was done for this example). 

The potential outcomes tell us the effect of a treatment on an individual basis; we 
define the unit-level causal effect as B,,(t = 1) — B,(t = 0) and an average causal 
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Unit || Treatment | Pot. Outcome | Pot. Outcome | Unit-Level Causal Effect 


u T B,(t =0) B,(t = 1) B,(t = 1) —B,(t =0) 
1 1 1 0 -1 
2 0 1 0 -1 
3 1 1 (0) -1 
43 1 1 -1 
44 1 1 
45 0 1 0 -1 
119 1 1 0 -1 
120 1 1 1 
121 0 1 (0) -1 
200 0 1 0 -1 


Table 6.2: This table presents Example 3.4 using potential outcomes. For each patient (or 
unit), we observe only one of the two potential outcomes. The observed information has a 
gray background. The treatment T is helpful for almost all patients. Only in 2 of 200 cases, 
the treatment harms the patient and blinds him B = 1. Although assigning the treatment 
(T = 1) is a good idea in most cases, for patient u = 120 it was exactly the wrong decision. 


effect 
1 
E=—) B,(t=1)—B,(t =0). 2 
> P (6.25) 


The “fundamental problem of causal inference” prevents us from computing (6.25) 
directly. Assume that in a completely randomized experiment, units u € Up C 
{1,...,n} received treatment T = 0 and units u € U; = UÇ treatment T = 1. Ney- 
man [1923] shows that 


CE :=—— F B,¢=1) - E B,¢=0) (6.26) 
ucU, 
is an unbiased estimator for (6.25). Here, the randomness in CE comes from the 
random assignments that determine, which of the unit’s two potential outcomes 
we observe; the outcomes themselves are considered hidden, not random. Note 
that (6.26) contains only observed quantities and can therefore be computed after 
the study has been conducted. 

There is an extensive debate about which of the two approaches is better suited 
for practical applications [see, e.g., Pearl, 1995, Imbens and Rubin, 1995, Rubin, 
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2004, Lauritzen, 2004]. We do not plan to take an active part in this discussion 
but rather mention the following three results: (1) We describe how to represent 
potential outcomes as counterfactuals [Pearl, 2009, Section 3.6.3]; (2) there is a 
logical equivalence between both frameworks [Galles and Pearl, 1998, Halpern, 
2000]; and (3), we comment on a recently proposed framework [Richardson and 
Robins, 2013] that brings both worlds closer together. 


6.9.2 Relation between Potential Outcomes and SCMs 


In SCMs, we can represent potential outcomes using the language of counterfac- 
tuals (Section 6.4). In the eye doctor example, the SCM € satisfies T = Ny and 
B=T-Ng+(1—T)-(1—Nz). We can therefore represent each patient by specific 
values for Ng and Nr. In Table 6.2, for example, patient 43 is characterized by 
Nr = 1,Nz = 0, while patient 44 satisfies Nr = 0,Npg = 1. The two terms t = 0 and 
t = 1 then correspond to interventions on T. Summarizing, we have that 


Bt =f) = B in the SCM €|N =n,; do (T := f), (6.27) 
— ——— ——— —_—_—— 
potential outcome counterfactual SCM 


where n, characterizes unit u [Pearl, 2009, Equation (3.51)]. Since in the coun- 
terfactual SCM all noise terms are deterministic, the entailed distribution of B is 
degenerate, too, and B is deterministic (as required). In the example shown in 
Table 6.2, we have sampled 200 i.i.d. units using Bernoulli distributions Nr ~ 
Ber(0.6) and Ng ~ Ber(0.01). In this case, SUTVA is satisfied. The iid. as- 
sumption implies that the units do not interfere with each other and modularity 
(intervening on T changes only the structural assignment for T) yields that the way 
the treatment is taken does not influence the result. 

We now discuss a result that shows in what sense both representations in (6.27) 
are equivalent. For this, we mainly follow the presentation in Pearl [2009, 7.3.1] 
and Halpern [2000]. The main argumentation is based on the following steps: 

1. Define the properties (axioms): (CO)-(C5) and (MP) [Halpern, 2000, Sec- 
tion 3]. Property (C4), for example, states that 


T(t =i,w =) =t; 


it postulates that setting variable T for unit u to t is “effective.” 
2. These axioms are satisfied in both representations (“soundness”). 


3. It can be shown that these properties are complete for counterfactual SCMs. 
Any counterfactual statement follows from one of these axioms. 
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4. We can conclude that any theorem that holds for counterfactual SCMs holds 
in the world of potential outcomes and vice versa.'+ Also, it follows from 
step 3. that any data set (like that in Table 6.2) satisfying the three axioms 
could be modeled with a counterfactual SCM. 15 

The two worlds differ, however, in their language. Even if every theorem holds 
true in both frameworks, some theorems might be “easier” to prove in one world 
than in the other. Similarly, any assumption that appears in a theorem imposes re- 
strictions on the underlying data-generating process; depending on the application, 
one formulation might simplify the assessment of these restrictions. Working with 
settings, in which the average causal effect is zero but the individual causal effects 
are non-zero, seems to be easier for potential outcomes. The graphical representa- 
tion of SCMs, on the other hand, might be beneficial to exploit assumptions on the 
causal relations between random variables. 

Richardson and Robins [2013] propose to use single world intervention graphs. 
These graphs allow us to set variables to certain values and therefore construct 
graphical correspondences to counterfactual variables. These modified graphs al- 
low us to read off conditional independence statements that involve both factual 
and counterfactual variables. We can therefore see these graphs as a useful tool to 
translate graphical assumptions into counterfactual statements that are often used 
by potential outcomes analysts. 


6.10 Generalized Structural Causal Models Relating 
Single Objects 


So far, we have studied causal relations among random variables X,,...,Xq and 
focused only on a scenario where the data are i.i.d. observations drawn from Px. 
We now consider a set v = {x),...,xa} of nodes of the causal DAG that consists 
of any mathematical objects x1,...,xg formalizing the idea of observations. For 
instance, after observing similarities among the texts x;,...,xy written by different 
authors, one may be interested in the causal relation in the sense of which author 
has been influenced by which one. Following Steudel et al. [2010], we now de- 


l4Strictly speaking, the “vice versa” requires that the potential outcome framework does not as- 
sume more than the axioms mentioned. 

‘Tf no SCM could possibly generate this data set, this would mean that counterfactuals from 
SCMs would satisfy another property not implied by the three axioms, namely the property that this 
data set cannot be generated. 
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scribe in which sense the underlying DAG also entails conditional independence 
statements, given an appropriate notion of information, without referring to statisti- 
cal sampling. To this end, we assume that we are given some information function 


Red ie. 


which is monotone in the sense that a set of nodes cannot contain more information 
than any of its supersets. Then, for any two sets x,y C v of nodes, the expression 
R(x,y) — R(y) is non-negative and can be interpreted as measuring the conditional 
information of x, given y. Moreover, we assume that R is such that for any three 
disjoint sets x,y,z of nodes, the expression 


I(x: y|z) := R(x,z) +R(y,z) — R(x, y,z) — R(z) (6.28) 


is non-negative, which is the case if and only if R is submodular (see Section 9.5.2). 
Then, we can interpret (6.28) as generalized conditional mutual information be- 
tween x and y, given z because R(x,z) — R(z) measures the information of x, given 
z while R(x,y,z) — R(y,z) is the information of x, given y and z. In the same way, 
conditional mutual information among random variables can be written as a differ- 
ence of Shannon entropies [Cover and Thomas, 1991]. If (6.28) vanishes, we call 
x and y conditionally independent, given z. 

To define generalized SCMs, one introduces unobserved noise objects n j for each 
observed node x; and postulates the following statement. 


Principle 6.50 (No additional information) A node x; contains no additional 
information on top of the information contained in its parent nodes pa; and the 
unobserved node nj, that is, 


R(xj,paj,n;) = R(pa;,n;). 


This generalizes the assumption that every random variable X; is determined by 
its parents and its noise variable, which for discrete random variables amounts to 
saying that the Shannon entropy of X;, PA ;,N; is the same as the one of PA ;,N;. 

The second crucial assumption of an SCM is the statistical independence of noise 
terms. The generalized version of this assumption reads as follows: 


Principle 6.51 (Independence of unobserved objects) The unobserved nodes nj; 
do not contain information about each other, that is, 


d 
R(n,.--,Na) = L R(nj). 
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Steudel et al. [2010] prove the following theorem. 


Theorem 6.52 (Generalized causal Markov condition) Zf both Principles 6.50 
and 6.51 hold, then x and y are conditionally independent, given z for any three 
set of nodes for which x and y are d-separated by 2. 


To apply these concepts to the text example, let us consider a text as a collec- 
tion of its meaningful words and let its information R be the number of different 
words. Assume that the influence among d texts x),...,xg is given by the following 
simplified mechanism: the author of x; takes some of the words from the parent 
texts of x; and adds some words from his own ideas. These additional words are 
given by nj. Then, Principle 6.50 is satisfied by definition of n;. According to 
Principle 6.51, the words added by different authors are assumed to be different. 
Two texts are conditionally independent, given a third one, if they only have words 
in common that already appear in the latter. The example shows that reasonable 
notions of conditional independence can be defined for a much broader class of ob- 
jects than random variables. To ensure that the causal Markov condition holds with 
respect to that particular notion of independence, the underlying information mea- 
sure needs to be appropriate for the respective class of causal mechanisms under 
consideration in the sense of Principles 6.50 and 6.51. 

Janzing and Schölkopf [2010] quantify the information between binary strings 
using Kolmogorov complexity K with respect to some fixed Turing machine T 
(see Section 4.1.9). The function K is approximately submodular up to terms of 
O(1), that is, an error that does not grow with the size of the considered strings. 
Then, Janzing and Schélkopf [2010] define an “algorithmic model of causality” 
where T computes each x; from its parents and a noise string nj, which ensures 
Principle 6.50. Each n; can also be interpreted as the program that computes x; 
from its parents, that is, the mechanism that generates x; from its direct causes. 
Then, Principle 6.51 amounts to the independence of the mechanisms (see Prin- 
ciple 2.1).'° Applying Theorem 6.52 to R = K yields the “algorithmic Markov 
condition” [Janzing and Schdlkopf, 2010]: whenever x and y are d-separated by 
Z, knowing y does not admit a shorter description of x with respect to a Turing 
machine that gets z as free background information. 

On a higher level, this addresses a deep problem of causal reasoning: the state- 
ment “dependences between observations only occur if they are causally related” 


!6This way, the second and the third branch of Figure 2.2 can be seen to coincide. The string n j 
encodes the mechanism (i.e., the program running on the Turing machine), and at the same time it is 
the analog of the noise term in the statistical setting. 
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(a generalization of Principle 1.1) only holds if the dependence measure is appro- 
priate for the class of observations and the class of potential causal mechanisms 
under consideration. For instance, after observing that the height of a child has 
increased during the past decade, and, at the same time, the value of some stock 
has increased, one would not infer them to be causally related because growth is 
a property that many time series share without being causally related. Only if two 
time series share more sophisticated patterns of different growth (and/or decrease), 
do we ask for the common reason behind the similarity. Since non-stationary time 
series are ubiquitous, it would be interesting to find information measures for which 
we believe dependences to indicate causal relations (after sufficiently accounting 
for multiple testing issues if the time series were found by searching over large 
databases). Speaking from a more applied machine learning perspective, the prob- 
lem leads us to construct appropriate features for which similarities in feature space 
indicate causal relations. 


6.11 Algorithmic Independence of Conditionals 


Section 6.10 shows that causal structures not only imply statistical (conditional) 
independences, but also independences with respect to other (non-statistical) in- 
formation measures. We have further seen that the Markov condition can also be 
stated for algorithmic information. Then the most elementary implication of the 
algorithmic Markov condition is an analogy of Reichenbach’s principle for algo- 
rithmic dependences. Two objects can only be algorithmically dependent when 
they have a common cause or when one of it influences the other [Janzing and 
Schélkopf, 2010]. This is because they are otherwise d-separated by the empty set 
and thus independent. Likewise, d objects x;,...,x,g that are causally unrelated are 
jointly algorithmically independent, that is, 


d 
K(x1,...,xa) = ¥ K(x). (6.29) 
j=l 
One can also call the difference between the left- and right-hand sides multi- 
information (in analogy to the corresponding terminology in statistical information 
theory) and write the joint independence as 


I|+ 


I(x1,x2, + 5%) =O. (6.30) 


Then, joint independence implies also independence of every subset. For instance, 
if the joint description of x;,x2 is shorter than the separate description of x; and 
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x2, then the joint description of x;,...,xqg is automatically shorter than the separate 
descriptions of all x; and thus (6.30) implies 


I|+ 


I(x : x2) 0. 


If we assume now that the conditionals!’ Px pa, in a causal graphical model are 
“independently chosen by nature,” then we conclude that they are jointly algorith- 
mically independent [Janzing and Schélkopf, 2010, Lemeire and Janzing, 2013] 
and state the multivariate version of Principle 4.13. 


Principle 6.53 (Algorithmic independence of conditionals (AIC)) The causal 
conditionals described by the Markov kernels in a causal Bayesian network as in 
Definition 6.21 (iii) are algorithmically independent, that is, 


+ 
I (Px pa; PoPa , Px. \pa,,) =0, (6.31) 
or equivalently, 


K(Px,,...X,) = YK (Pxipa,)- (6.32) 
1 


Note that Principle 6.53 must not be confused with the algorithmic Markov con- 
dition discussed in Section 6.10. While the latter refers to causal relations among n 
single objects without referring to statistical sampling, the former still assumes 
the traditional i.i.d. setting with n random variables and only states an additional 
inference principle. 

As for the bivariate case, the equivalence of (6.31) and (6.32) is immediate be- 
cause describing the joint distribution is equivalent to describing all the causal 
Markov kernels. In other words, AIC states that the shortest description of the 
joint distribution is given by separate descriptions of the causal Markov kernels. 

Causal faithfulness and AIC are related in spirit and often yield similar conclu- 
sions. To discuss similarities and differences, we revisit Example 6.34. Since 
the parameter a describes Pyy and the parameters (b,c) describe the conditionals 
Pzx y, we have 


I(Px : Paxy) È I(a : (b,c). (6.33) 


This is because the algorithmic mutual information between two objects cannot be 
increased by restricting the attention to some of their “aspects;” see, for example, 


17 As stated before, we use the notation Pyjx as a shorthand for the collection (Pyix=x)x of condi- 
tional distributions. 
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Janzing and Schélkopf [2010, Lemma 6]. The “non-generic” independence X IL Z 
occurs when the structure coefficients of the linear model satisfy 


a-b+c=0. (6.34) 


Then K(a|b,c) = 0 because a can be computed from b,c via a program of length 
O(1). Thus, 
I(a: (b,c)) = K(a) — K(a|(b,c)") = K(a). 


We conclude that AIC is violated whenever K(a) is significantly larger than 0. 
For a generic real number a, K(a) grows logarithmically with the desired (rela- 
tive) accuracy. Then AIC rejects the corresponding causal DAG because (6.34) is 
considered an unlikely coincidence. 

We have to explain the phrase “whenever K (a) is significantly larger than 0” be- 
cause it amounts to a conceptual difference between AIC and faithfulness. Assume, 
for instance, that b = c and a = —1. Then (6.34) is satisfied, yet the description of 
a does not get shorter when b and c are known because K (a) is already negligible. 
Therefore, that AIC is not violated despite (6.34) seems to indicate fine-tuning of 
parameters. Following Lemeire and Janzing [2013], we now argue why we con- 
sider not rejecting this kind of tuning as a feature of AIC rather than as a flaw. The 
idea is that structure coefficients +1 (up to some given precision) occur much more 
often in nature than some “more generic” value such as 2.36724.... For instance, 
spending some money S decreases the amount A of available money by —S. The 
causal relation between S and A is thus described by!® the structure coefficient —1. 
Implicitly, AIC and our argument are based on a prior that considers values with 
short description length as more likely (in agreement with Solomonoff’s theory of 
inductive inference [Solomonoff, 1964]). 

Another feature of AIC is that it also rejects almost cancellation of different 
paths: assume, for instance, that a is very close to —c/b. To estimate I(a: (b,c)) 
in this case, we observe 


Haves ax 


and use the following idea. The algorithmic mutual information of two integers 
n,m that are close to each other is typically about logn/|m — n| because describing 


'8The example suggests that structure coefficients being simple is often a result of how we define 
variables rather than being a property of “nature.” In general, one may wonder to what extent we 
define variables in a way that yields simple causal relations. 
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n after m is known requires about log |n — m| bits, while it requires about logn bits 
otherwise. After arbitrarily fine discretization, we may then represent a and c/b by 
integers and take log|a/(a+c/b)| as a rough estimation for the algorithmic mutual 
information between Pyy and Pz\y y- 


6.12 Problems 


Problem 6.54 (DAGs) Table B.1 on page 223 states that for three nodes there are 
25 DAGs. Why is this the case? 


Problem 6.55 (Multivariate SCMs) Consider the following SCM € 


V := Ny 

W := —2V + 3Y + 5Z + Nw 
X := 2V +Nyx 

Y := —X + Ny 

Z := QX +Nz 


with Ny, Nw, Nx, Ny, Nz 2 N (0,1). 


a) Draw the graph corresponding to the SCM. 


b) Set & = 2 and simulate 200 i.i.d. data points from the joint distribution; plot 
the values of X and W to visualize the distribution Pw- 


c) Again, set & = 2 and sample 200 i.i.d. data points from the intervention 
distribution 
€;do(X:=3) 
Py WwW 


in which we have intervened on X. Again, plot the sample and compare with 
the plot from part b. 


d) A directed path from one node to another does not necessarily imply that the 
former node has a causal effect on the latter. Choose a value of œ and prove 
that for this value X, has no causal effect on W. 


e) For any given &, compute 


kS pE do(X:=x) [w] ; 
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Problem 6.56 (Interventions) Consider the SCM 


X = Ny 
Y := (X — 4} +Ny 
Z:=X° +Y? +Nz 


with Ny, Ny,Nz #4 A(O, 1). You may intervene on either X or Y. Which hard 
intervention yields the smallest expected value of Z? 


Problem 6.57 (Minimality) We have stated in Remark 6.6 that causal minimality 
(Definition 6.33) implies structural minimality. 


a) Convince yourself that this is shown by Proposition 6.49. 


b) Provide an example of an SCM that satisfies structural minimality but vio- 
lates causal minimality. 


Problem 6.58 (Causal Minimality) Consider a causal graphical model with a 
distribution that has a strictly positive, continuous density and for which causal 
minimality is violated. According to Proposition 6.36, we can then remove an 
“inactive” edge from the graph and obtain a new causal graphical model. Prove 
that the two models are interventionally equivalent. 


Problem 6.59 (Interventional equivalence) Consider two SCMs ©; and © of 
the form 


X = Ny 

Y := X +Ny 

Z:= f;(X,Y)+Nz 
with Ny, Ny ,Nz Wu, 1), a continuous uniform distribution between —1 and 1. 
Choose the functions fı and f2 such that € and € are observationally equivalent, 
and agree on all single node interventions, but disagree on simultaneous interven- 
tions on several nodes. This problem shows that Proposition 6.48 does not need to 
be true if the density is not strictly positive. 


Problem 6.60 (Cyclic SCMs) Prove that whenever the absolute values of the 
eigenvalues of a square matrix B are strictly smaller than 1 (i.e., the spectral radius 
of B is strictly smaller than 1), then I — B is invertible. 
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Problem 6.61 (Cyclic SCMs) Consider the assignment X := BX +N, as de- 
scribed in Remark 6.5. Prove that if the spectral radius of B is strictly smaller than 
1, then X! defined by X! := BX"! +N in Equation (6.3) converges in distribution 
against X := (I — B)~'N as defined in Equation (6.2). 


Problem 6.62 (d-separation) Prove that one can d-separate any two nodes in 
a DAG G that are not directly connected by an edge. Use this statement to prove 
Proposition 6.35. 


Problem 6.63 (Covariate adjustment) Assume that Z is a valid adjustment set 
for the causal effect from X to Y and that (Y,X,Z) has a (zero mean) Gaussian 
distribution with 


E[Y |X =x,Z = z| = ax + bz. 


Prove that 
d 


Ox 
in other words, prove Equation (6.20) using Equations (6.19) and (6.13). This 
result allows us to consistently estimate the causal effect a by regressing Y on X 
and Z. 


me; do(X:=x) [y] =a; 


Problem 6.64 (Covariate adjustment) Prove the parent adjustment and the back- 
door criterion Proposition 6.41 (i) and (ii) using Equation (6.17). 


Problem 6.65 (Covariate adjustment) Prove the frontdoor criterion (6.23) start- 


ing with 
gues) y) = Ler (y | zx) pë E=») (z) 


z 
g 


and then using rules 2 and 3 from do-calculus (Section 6.7). 


7 


Learning Multivariate Causal Models 


As in Chapter 4, we now turn to the problem of learning causal models. We first 
discuss different assumptions under which (parts of) the graph structure can be re- 
covered from the joint distribution in Section 7.1 (“structure identifiability”). Some 
of these results carry over from the bivariate setting discussed earlier. As in the bi- 
variate case, there is no complete characterization of identifiability assumptions, 
and future research may reveal promising alternatives. In Section 7.2, we then 
introduce methods and algorithms, such as independence-based and score-based 
methods, that estimate the graph from a finite data set (“structure identification”). 

As in the bivariate setting, we are again facing the problem that the class of SCMs 
is too flexible. Given a distribution Px over random variables X = (X1,...,Xq), can 
different SCMs entail this distribution? This question is answered by the following 
proposition: indeed, usually for many different graph structures, there is an SCM 
that induces the distribution Px.! 


Proposition 7.1 (Non-uniqueness of graph structures) Consider a random vec- 
tor X = (X1,...,Xqa) with distribution Px that has a density with respect to Lebesgue 
measure and assume it is Markovian with respect to G. Then there exists an SCM 
€ = (S, Px) with graph G that entails the distribution Px. 


Proof. See Appendix C.9. 


In particular, given any complete DAG, we can find a corresponding SCM that 
entails the distribution at hand. As in the bivariate case, it is therefore apparent 


' Statements similar to Proposition 7.1 can be found in Druzdzel and Simon [1993] and Druzdzel 
and van Leijen [2001]. 


136 Chapter 7. Learning Multivariate Causal Models 


that we require further assumptions to obtain identifiability results. The following 
section discusses some of those assumptions. 


7.1 Structure Identifiability 


7.1.1 Faithfulness 


If the distribution Px is Markovian and faithful with respect to the underlying DAG 
g?, we have a one-to-one correspondence between d-separation statements in the 
graph G° and the corresponding conditional independence statements in the distri- 
bution. All graphs outside the correct Markov equivalence class of G? can therefore 
be rejected because they impose a set of d-separations that does not equal the set 
of conditional independences in Px. Since both the Markov condition and faithful- 
ness put restrictions only on the conditional independences in the joint distribution, 
it is also clear that we are not able to distinguish between two Markov equivalent 
graphs, that is, between two graphs that entail exactly the same set of conditional 
independences (see for example Figure 6.4 on page 103). Summarizing, under 
the Markov condition and faithfulness, the Markov equivalence class of G°, repre- 
sented by CPDAG(G°), is identifiable from Px [e.g., Spirtes et al., 2000]. 


Lemma 7.2 (Identifiability of Markov equivalence class) Assume that Px is 
Markovian and faithful with respect to G°. Then, for each graph G € CPDAG(GQ°), 
we find an SCM that entails the distribution Px. Furthermore, there is no graph G 
with G ¢ CPDAG(G°), such that Px is Markovian and faithful with respect to G. 


Proof. The first statement is a direct implication from Proposition 7.1, and the 
second statement follows from the definitions of Markov equivalence, seen in Def- 
inition 6.24. 


Independence-based methods (also called constraint-based methods) assume that 
the distribution is Markovian and faithful with respect to the underlying graph and 
then estimate the correct Markov equivalence class; see Section 7.2.1. 

We have seen in Example 6.42 that for Gaussian distributions the causal effect 
can be summarized by a single number (6.20). If instead of the correct graph, 
we only know the Markov equivalence class of that graph, this quantity is not 
identifiable anymore. It is possible, however, to provide bounds [Maathuis et al., 
2009]. 
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7.1.2 Additive Noise Models 


Proposition 7.1 shows that a given distribution could have been entailed from sev- 
eral SCMs with different graphs. For many of these graph structures, however, the 
functions fj appearing in the structural assignments are rather complicated. It turns 
out that we obtain non-trivial identifiability results if we do not allow for arbitrar- 
ily complex functions, that is, if we restrict the function class. As we have already 
seen in Chapter 4, we will assume in the following Sections 7.1.4 and 7.1.5 that the 
noise acts in an additive way. 


Definition 7.3 (ANMs) We call an SCM € an ANM if the structural assignments 
are of the form 
Xj := fj(PA;) +Nj, J= lesd, (7.1) 


that is, if the noise is additive. For simplicity, we further assume that the functions 
f; are differentiable and the noise variables N; have a strictly positive density.” 


Some of the following identifiability results assume causal minimality (Defini- 
tion 6.33). For ANMs, this means that each function f; is not constant in any of its 
arguments. Intuitively, the function should really “depend” on its arguments. The 
proof of the following proposition is provided in Appendix C.10. 


Proposition 7.4 (Causal minimality and ANMs) Consider a distribution in- 
duced by a model (7.1) and assume that the functions fj are not constant in any 
of its arguments, that is, for all j and i € PA, there is some value pa, _; of the 
variables PA ; \ {i} and some x; # x; such that 


fi (pa ji, xi) fa fi (paj_i:X;)- 


Then the joint distribution satisfies causal minimality with respect to the corre- 
sponding graph. Conversely, if there are nodes j and i such that for all pa; _; the 
function fipa; -i -) is constant, causal minimality is violated. 


We have argued in Remark 6.6 that we can restrict ourselves to functions that are 
not constant in one of their arguments; see Proposition 6.49. We have now seen that 
for ANMs with fully supported noise, this restriction implies causal minimality. 

Given the restricted class of SCMs described in (7.1), do we obtain full structure 
identifiability? Again, the answer is negative. Theorem 4.2 and Problem 7.13 


2These two conditions guarantee that the joint distribution over X,...,Xq allows for a strictly 
positive density, for example. 
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: Condition | DAG 
Type of structural assignment on funct. | identif. See 
(General) SCM: Xj:= fi(Xpra Nj) — X Prop. 7.1 
ANM: Xj := Fj(Xpa,) +Nj nonlinear y Thm. 7.7) 
CAM: Xj := Leera, fik(Xk) +N; | nonlinear y Thm. 7.7(i1) 
Linear Gaussian: Xj := Leera, BikXk+Nj linear X Problem 7.13 
Lin. G., eq. error var.: Xj := Lier, BjikXk+Nj linear y Prop. 7.5 


Table 7.1: Summary of some known identifiability results for Gaussian noise. Results for 
non-Gaussian noise identifiability results are available, too, but they are more technical. 


show that if the distribution is induced by a linear Gaussian SCM, for example, 
we cannot necessarily recover the correct graph. It turns out, however, that this 
case is exceptional in the following sense. For almost all other combinations of 
functions and distributions, we obtain identifiability. All the nonidentifiable cases 
have been characterized [Zhang and Hyvärinen, 2009, Peters et al., 2014]. Another 
non-identifiable example different from the linear Gaussian case is shown in the 
right plot in Figure 4.2. Its details can be found in Peters et al. [2014, Example 25]. 
Table 7.1 shows some of the known identifiability results. 

Let us mention again that there are several extensions to the framework of ANMs. 
For example, Zhang and Hyvärinen [2009] allow for a post-nonlinear transforma- 
tion of the variables and Peters et al. [201 1a] consider ANMs for discrete variables. 

In general, nonlinear ANMs are not closed under marginalization. That is, if 
Py y z allows for ANMs from X to Y and from Y to Z, Py z does not necessarily 
allow for an ANM from X to Z. This may restrict the applicability of ANMs 
in practice, since one may not observe intermediate variables on a causal path. 
For experiments in physics, one could argue that every influence is propagated via 
infinitely many intermediate variables. Thus, there is no absolute notion of direct 
or indirect effect (instead, it must always be relative to the observed set). In this 
sense, ANMs can only be taken as good approximations. 

In the following three subsections, we will look at three specific identifiable ex- 
amples in more detail: the linear Gaussian case with equal error variances (Sec- 
tion 7.1.3), the linear non-Gaussian case (Section 7.1.4), and the nonlinear Gaus- 
sian case (Section 7.1.5). Although more general results are available [Peters et al., 
2014], we concentrate on those two examples because for them precise conditions 
can be stated easily. We omit proofs and concentrate on the statements. Most of 
the proofs can be based on the techniques developed in Peters et al. [2011b]. They 
allow many of the bivariate identifiability results that we developed in Chapter 4 to 
carry over to the multivariate setting. 
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7.1.3 Linear Gaussian Models with Equal Error Variances 


There is another deviation from linear Gaussian SEMs that makes the graph iden- 
tifiable. Peters and Bühlmann [2014] show that restricting the noise variables to 
have the same variance is sufficient to recover the graph structure. The proof can 
be found in Peters and Bühlmann [2014]. 


Proposition 7.5 (Identifiability with equal error variances) Consider an SCM 
with graph Go and assignments 


Xj:= YL ByeXe+Nj, j=1,...,d, 
kePAS 


where all N; are i.i.d. and follow a Gaussian distribution. In particular, the noise 
variance o° does not depend on j. Additionally, for each j € {1,...,p} we re- 
quire B jx £ 0 for all k € PA% Then, the graph Go is identifiable from the joint 
distribution. 


For estimating the coefficients B ;, (and therefore the graph structure) Peters and 
Biihlmann [2014] propose to use a penalized maximum likelihood score based 
on the Bayesian information criterion (BIC); see also Section 7.2.2, and a greedy 
search algorithm in the space of DAGs. Rescaling the variables changes the vari- 
ance of the error terms. Therefore, in many applications, model (7.2) cannot be 
sensibly applied. The BIC, however, allows us to compare the method’s score with 
the score of a linear Gaussian SCM that uses more parameters and does not make 
the assumption of equal error variances. 


7.1.4 Linear Non-Gaussian Acyclic Models 


Shimizu et al. [2006] prove the following statement using independent compo- 
nent analysis (ICA) [Comon, 1994, Theorem 11], which itself is proved using the 
Darmois-Skitovié theorem. 


Theorem 7.6 (Identifiability of LINGAMs) Consider an SCM with graph Go and 
assignments 


Xj= ) BX t+Nj, j=1,...,d, (1.2) 
kePAS 


where all N; are jointly independent and non-Gaussian distributed with strictly 
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positive density.? Additionally, for each j € {1,...,p}, we require Bik 40 for all 
ke PA®, Then, the graph Go is identifiable from the joint distribution. 

The authors call this model a LINGAM. As mentioned in Section 4.1.3, there is 
an alternative proof for Theorem 7.6: Theorem 28 in Peters et al. [2014] extends 
bivariate identifiability results such as Theorem 4.2 to the multivariate case. This 
trick is also used for nonlinear additive models (by extending Theorem 4.5). 


7.1.5 Nonlinear Gaussian Additive Noise Models 


We have seen that the graph structure of an ANM becomes identifiable if the as- 
signments are linear and the noise variables are non-Gaussian. Alternatively, we 
can also exploit nonlinearity. The result is easiest to state with Gaussian noise: 


Theorem 7.7 (Identifiability of nonlinear Gaussian ANMs) 
(i) Let Px = Px,,...x, be induced by an SCM with 
Xj = fj(PA;) +Nj, 
with normally distributed noise variables Nj ~ N (0,07) and three times 


differentiable functions f; that are not linear in any component in the fol- 
lowing sense. Denote the parents PA ; of X; by Xx,,-..,Xk,, then the function 


Fi (ley + + Xkeg1 9 +X kay 19+ ++ sXkp) is assumed to be nonlinear for all a and some 
Xk yess yXkg ps Xka pass Xk E R! 
(ii) As a special case, let Px = Px, „x, be induced by an SCM with 
X= $, fia (Xi) + Nj, (7.3) 
kePA, 


with normally distributed noise variables N; ~ N (0,07) and three times 
differentiable, nonlinear functions fjg. This model is known as a causal 
additive model (CAM). 


In both cases (i) and (ii), we can identify the corresponding graph Go from the 
distribution Px. The statements remain true if the noise distributions for source 
nodes, that is, nodes without parents, are allowed to have a non-Gaussian density 
with full support on the real line R (the proof remains identical). 


The proof can be found in Peters et al. [2014, Corollary 31]. 


3The condition of a strictly positive density can be weakened (see details of the proof of ICA), 
but it is certainly necessary to assume that the noise variables are non-degenerate, for example. 
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7.1.6 Observational and Experimental Data 


We have already seen in Section 6.3 that knowing causal relations can help improve 
predictions when the underlying distribution changes. We will now turn this idea 
around and show how observing the system in different environments can be used 
to learn causal relations. We therefore turn to the following setup, in which we 
observe data from different environments e € E. The corresponding model reads 


X° = (X$, 9 aw P, 


where each variable X? denotes the same (physical) quantity, measured in environ- 
ment e € E. We will talk about a variable X; in different environments, which is a 
slight abuse of notation. 


Known Intervention Targets A first type of method assumes that the differ- 
ent environments stem from different interventional settings. In the case that the 
intervention targets Z° C {1,...,d} are known, several methods have been pro- 
posed. Tian and Pearl [2001] and Hauser and Bühlmann [2012], for example, 
assume faithfulness and consider mechanism changes and stochastic interventions, 
respectively. They define and characterize the interventional equivalence classes 
of graphs: that is, the class of graphs that can explain the given distributions. For 
mechanism changes, for example, we can include an intervention node into the 
model whose children are the variables that are intervened on. This way we in- 
crease the number of v-structures and two graphs become intervention equivalent 
(with respect to the given distributions) if they have the same skeletons and v- 
structures, and the nodes that are intervened on have the same parents [cf. Tian 
and Pearl, 2001, Theorem 2]. Eberhardt et al. [2010] allow for hard and stochastic 
interventions even in the presence of cycles. 

Hyttinen et al. [2012] analyze conditions on the interventions under which the 
graph becomes identifiable. Eberhardt et al. [2005] and Hauser and Bühlmann 
[2014] investigate how many intervention experiments are necessary in the worst 
case to identify the graph. 


Different Environments Let us now turn to a slightly different setting, in which 
we do not try to learn the whole causal structure. Instead, we consider a target 
variable Y with a set of d predictors X and try to learn which of the predictors are 
the causal parents of Y. Both X and Y are observed in different environments e € E 
(which could be intervention settings with unknown targets). That is, we have 


(X°, Y°) ~N Pye ye = P! 
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for e € E. The key assumption is the existence of an unknown set PAy C {1,...,d} 
(one may think of the direct causes of Y) such that the conditional Y given PAy is 
invariant over all environments, that is, for all e, f € E we have 


Pye|pag = YS |PAt: 


This assumption is satisfied if the distributions are induced by an underlying SCM 
and the different environments correspond to different intervention distributions, 
for which Y has not been intervened on [Peters et al., 2016] (see Code Snippet 7.11 
for an example). Having said that, the setting is more general and the environments 
do not need to correspond to interventions; one does not even require an underlying 
SCM. One can consider the collection S of all sets S C {1,...,d} of variables that 
lead to “invariant prediction,” that is, for all e, f € E and for all S € S, we have 


Pye | se = Pyr\sf- (7.4) 


Here, Y° | S° is shorthand notation for Y° | X$. It is not difficult to see (Problem 7.15) 
that the variables appearing in all those sets S € S must be direct causes of Y: 


(NS C Pay, (7.5) 
SES 


where we define the intersection over an empty index set as the empty set. Peters 
et al. [2016] consider the left-hand side of (7.5) as an estimate for PAy. (7.5) then 
guarantees that any variable contained in the output of this method is indeed in 
PAy. In the special case of SCMs and interventions, there are sufficient conditions 
[Peters et al., 2016] under which PAy becomes identifiable, in other words, (7.5) is 
an equality. Interestingly, the method we present in Section 7.2.5 realizes whether 
the data come from such an identifiable case, it does not need to assume it. 

Tian and Pearl [2001] also address the question of identifiability with unknown 
intervention targets. They do not specify a target variable and focus on changes in 
marginal distributions rather than conditionals. 


7.2 Methods for Structure Identification 


We have seen several assumptions that lead to (partial) identifiability of the causal 
structure. The purpose of this section is to show how these assumptions can be 
exploited to provide estimators of the underlying graph from a finite amount of 
data (see Figure 7.1 for two examples). We provide an overview of methods and 
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try to focus on their ideas. There is a large pool of methods, and we believe that 
future research needs to show which of these methods will prove to be most useful 
in practice. We nevertheless try to highlight some of the methods’ potential prob- 
lems and most crucial assumptions. Although some papers study the consistency 
of the presented methodology, we omit most of those results and present ideas 
only. Subtleties of algorithmic implementation will not be discussed either, and we 
would like to refer the interested reader to the references we provide. Kalisch et al. 
[2012] maintain the software package pcalg for R [R Core Team, 2016] that con- 
tains code not only for the PC (for the inventors Peter Spirtes and Clark Glymour) 
algorithm (see Section 7.2.1), but also for many of the described methods. 

Before providing more details about the existing methodology, we would like to 
add two comments first: (1) While there are several simulation studies available, a 
topic that receives little attention is the question of a loss function. Given the true 
underlying causal structure, how “good” is an estimated causal graph? In practice, 
one often uses variants of the structural Hamming distance [Acid and de Campos, 
2003, Tsamardinos et al., 2006], which counts the number of misspecified edges. 
As an alternative, Peters and Biihlmann [2015] suggest evaluating the graph based 
on its ability to predict intervention distributions. (2) Some of the methods that we 
present assume that the structural assignments (6.1) and the corresponding func- 
tions f; in particular are simple. Often, those methods do provide estimates not 
only for the causal structure but also for the corresponding assignments, which can 
usually be used to compute residuals, too. In principle, and under this model, we 
can then test the strong assumption of mutually independent noise variables (Defi- 
nition 3.1), for example, by applying a mutual independence test [e.g., Pfister et al., 
2017]; see Section 4.2.1 for statistical subtleties of such a procedure. 


7.2.1 Independence-Based Methods 


Independence-based methods such as the inductive causation (IC) algorithm, the 
SGS (for the inventors Spirtes, Glymour, and Scheines) algorithm, and the PC 
algorithm assume that the distribution is faithful to the underlying DAG. This ren- 
ders the Markov equivalence class, that is, the corresponding CPDAG, identifiable 
(see Section 7.1.1). There is a one-to-one correspondence between d-separations 
in the graph and conditional independences in Px. Any query of a d-separation 
statement can therefore be answered by checking the corresponding conditional 
independence test. We first assume that an oracle provides us with the correct an- 
swers to the conditional independence questions and discuss some finite sample 
issues in the paragraph “Conditional Independence Tests.” 
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Figure 7.1: The figure summarizes two approaches for the identification of causal struc- 
tures. Independence-based methods (top) test for conditional independences in the data; 
these properties are related to the graph structure by the Markov condition and faithfulness. 
Often, the graph is not uniquely identifiable; the method may therefore output different 
graphs G and G’. Alternatively, one may restrict the model class and fit the SCM directly 
(bottom). 


Estimation of Skeleton Most independence-based methods first estimate the 
skeleton, that is, the undirected edges, and orient as many edges as possible after- 
ward. For the skeleton search, the following lemma is useful to know [see Verma 
and Pearl, 1991, Lemma 1]. 


Lemma 7.8 The following two statements hold. 


(i) Two nodes X,Y in a DAG (X,€) are adjacent if and only if they cannot be 
d-separated by any subset S C V \ {X,Y}. 


(ii) If two nodes X,Y ina DAG (X, €) are not adjacent, then they are d-separated 
by either PAy or PAy. 


Using Lemma 7.8(i), we have that if two variables are always dependent, no mat- 
ter what other variables one conditions on, these two variables must be adjacent. 
This result is used in the IC algorithm [Pearl, 2009] and in the SGS algorithm 
[Spirtes et al., 2000]. For each pair of nodes (X,Y), these methods search through 
all possible subsets A C X \ {X,Y} of variables neither containing X nor Y and 
check whether X and Y are d-separated given A. After all those tests, X and Y are 
adjacent if and only if no set A was found that d-separates X and Y. 
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Searching through all possible subsets A does not seem optimal, especially if 
the graph is sparse. The PC algorithm [Spirtes et al., 2000] starts with a fully 
connected undirected graph and step-by-step increases the size of the conditioning 
set A, starting with #A = 0. At iteration k, it considers sets A of size #A = k, 
using the following neat trick: to test whether X and Y can be d-separated, one 
only has to go through sets A that are subsets either of the neighbors of X or of 
the neighbors of Y; this idea is based on Lemma 7.8(ii) and clearly improves the 
computation time, especially for sparse graphs. 


Orientation of Edges Lemma 6.25 suggests that we should be able to orient the 
immoralities (or v-structures) in the graph. If two nodes are not directly connected 
in the obtained skeleton, there is a set that d-separates these nodes. Suppose that 
the skeleton contains the structure X — Z — Y with no direct edge between X and 
Y; further, let A be a set that d-separates X and Y. The structure X — Z — Y is 
an immorality and can therefore be oriented as X + Z + Y if and only if Z ¢ A. 
After the orientation of immoralities, we may be able to orient some further edges 
in order to avoid cycles, for example. There is a set of such orientation rules that 
has been shown to be complete and is known as Meek’s orientation rules [Meek, 
1995]. 


Satisfiability Methods An alternative to the graphical approach just described is 
to formulate causal learning as a satisfiability (SAT) problem [Triantafillou et al., 
2010]. First, one formulates graphical relations as Boolean variables, such as A := 
“There is a direct edge from X to Y.” The non-trivial part is then to translate the in- 
dependence statements (we still assume that they are provided by an independence 
oracle), as d-separation statements into “formulas” that involve Boolean variables 
and the operators “and” and “or.” The SAT question then asks whether we can as- 
sign a value “true” or “false” to each of the Boolean variables to make the overall 
formula true. SAT solvers not only check whether this is the case but also pro- 
vide us with the information as to whether in all of the assignments that make the 
overall formula true, certain variables are always assigned to the same value. For 
example, the d-separation statements may be satisfied by different graph structures 
that correspond to different assignments, but if in all such assignments the Boolean 
variable A from above takes the value “true,” we can infer that in the underlying 
graph, X must be a parent of Y. Even though the Boolean SAT problem is known 
to be nondeterministic polynomial time (NP)-complete [Cook, 1971, Levin, 1973], 
that is, itis NP and NP-hard, there are heuristic algorithms that can solve instances 
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of large problems, involving millions of variables. SAT methods in causal learning 
allow us to query specific statements as an ancestral relation rather than estimat- 
ing the full graph. They let us incorporate different kinds of prior knowledge and 
furthermore, we can put weights on the independence constraints if we believe 
that some of the (statistical) findings contradict each other. These approaches have 
been extended to cycles, latent variables, and overlapping data sets [Hyttinen et al., 
2013, Triantafillou and Tsamardinos, 2015]. 


Conditional Independence Tests In the three preceding paragraphs we have as- 
sumed the existence of an independence oracle that tells us whether a specific (con- 
ditional) independence is or is not present in the distribution. In practice, however, 
we have to infer this statement from a finite amount of data. This comes with two 
major challenges: (1) All causal discovery methods that are based on conditional 
independence tests draw conclusions both from dependences and independences. 
In practice, however, one most often uses statistical significance tests, which are 
inherently asymmetric. One therefore usually forgets about the original meaning of 
the significance level and treats it as a tuning parameter. Furthermore, due to finite 
samples, the testing results might even contradict each other in the sense that there 
is no graph structure that encodes the exact set of inferred conditional indepen- 
dences. (2) Although there is some recent work on kernel-based tests [Fukumizu 
et al., 2008, Tillman et al., 2009, Zhang et al., 2011], nonparametric conditional 
independence tests are difficult to perform with a finite amount of data. One there- 
fore often restricts oneself to a subclass of possible dependences, some of which 
we now briefly review. 

If the variables are assumed to follow a Gaussian distribution, we can test for 
vanishing partial correlation (see Appendices A.1 and A.2). Under faithful- 
ness, the Markov equivalence class of the underlying DAG becomes identifiable 
(Lemma 7.2) and indeed, in the Gaussian setting, the PC algorithm with a test for 
vanishing partial correlation provides a consistent estimator for the correct CPDAG 
[Kalisch and Btihlmann, 2007]. Additionally assuming a condition called strong 
faithfulness [Zhang and Spirtes, 2003, Uhler et al., 2013] even yields uniform con- 
sistency [Kalisch and Biihlmann, 2007]; see also the discussion in Robins et al. 
[2003]. 

Non-parametric conditional independence testing is a difficult problem in the- 
ory and practice. For non-Gaussian distributions, vanishing partial correlation is 
neither necessary nor sufficient for conditional independence, as shown by the fol- 
lowing example. 
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Example 7.9 (Conditional independence and partial correlation) 
(i) If the distribution Py yz is entailed by the SCM 


Z:=Nz, X:=Z+Nx, Y:=Z7+Ny, 


where Ny ,Ny,Nz NN (0, 1), it satisfies 
A LY |Z and Pxy|z #0. 


The partial correlation coefficient Py y|z equals the correlation of X — œZ 
and Y — BZ where œ and f are the regression coefficients when regressing 
X and Y on Z, respectively. In this example, œ = B = 0 because X and Y do 
not correlate with Z. 


(ii) The distribution Py y z entailed by the SCM 
Z:=Nz, X:=Z+Nx, Y:=Z+WNy, 


where (Ny,Ny) JL Nz and (Ny,Ny) are uncorrelated but not independent, 
satisfies 
XLY|Z and Pxy|z =9 


since here, py yjz is the correlation between Ny and Ny. 


Therefore, vanishing partial correlation does not imply and is not implied by con- 
ditional independence. 


The following procedure for testing whether X and Y are conditionally indepen- 
dent given Z provides a natural nonlinear extension of partial correlation [e.g., 
Ramsey, 2014]: (1) (nonlinearly) regress X on Z and test whether the residuals are 
independent of Y; (2) (nonlinearly) regress Y on Z and test whether the residuals 
are independent of X; (3) if one of those two independences hold, conclude that 
X | Y|Z. This seems to be the correct test in the case of ANMs; see Section 7.1.2. 
For three variables, for example, we have the following result. 


Proposition 7.10 Consider a distribution Py yz induced by an ANM (Defini- 
tion 7.3) with all variables having strictly positive densities. If X and Y are d- 
separated given Z, then the procedure just described outputs the corresponding 
conditional independence in the sense that either X —E|X|Z] is independent of Y 
or Y —E|Y |Z] is independent of X. 
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Proof. Assume that X := h(Z)+ Ny and Y := f(Z) +Ny, with Z, Ny, and Ny 
being mutually independent. Then, X — E[X|Z] = Ny is independent of Y. The 
statement follows analogously for the other possible structures, for example, X — 
ZY orX <-Z+Y. 


The proposition shows that (in a population sense) the test described is appro- 
priate for ANMs with three variables. Considering four variables X ,Y,Z,V, how- 
ever, may already lead to problems. Clearly, the graphs X «+ Z => W — Y and 
X — Z— W —> Y are Markov equivalent. But while the test outputs X L Y |Z for 
the first graph, there is no such guarantee for the second graph. Thus, the above- 
mentioned restriction of the dependence model between random variables that can 
be used to construct feasible conditional independence tests leads to asymmetric 
treatment of graphs within a Markov equivalence class. This effect may be the 
same for many other types of methods for conditional independence testing. This 
asymmetry does not necessarily need to be a drawback since, as we have seen, re- 
stricted function classes may lead to identifiability within the Markov equivalence 
class (see Section 7.1). It certainly requires consideration, though. 


7.2.2 Score-Based Methods 


In the preceding section we have directly used the independence statements to in- 
fer the graph. Alternatively, we can test different graph structures in their ability to 
fit the data. The rationale is that graph structures encoding the wrong conditional 
independences, for example, will yield bad model fits. Although the roots for 
score-based methods for causal learning may date back even further, we mainly re- 
fer to Geiger and Heckerman [1994a], Heckerman et al. [1999], Chickering [2002], 
and references therein. The Max-Min Hill-Climbing algorithm [Tsamardinos et al., 
2006] combines score-based and independence-based techniques. 


Best Scoring Graph Given data D = (X!,...,X”) from a vector X of variables, 
that is, a sample containing n i.i.d. observations, the idea is to assign a score 
S(D,G) to each graph G and search over the space of DAGs to find the graph 
with the highest score: 

G:= argmax S(D,G). (7.6) 

G DAG over X 

There are several possibilities to define such a scoring function S. Often a paramet- 
ric model is assumed (e.g., linear Gaussian equations or multinomial distributions), 


which introduces a set of parameters 0 € ©. 


7.2. Methods for Structure Identification 149 


(Penalized) Likelihood For each graph we may consider the maximum likeli- 
hood estimator @ for @ and then define a score function by the BIC 


#parameters 


S(D,G) =log p(D|8,9) - — 


logn, (7.7) 
where log p(D|@,G) is the log likelihood and n is the sample size. Estimators that 
output the graph with the largest (penalized) likelihood are often consistent. This 
follows from the consistency of BIC [Haughton, 1988], and identifiability of the 
model class. To guarantee rates of convergence, however, one usually relies on a 
“degree of identifiability” [e.g., Bühlmann et al., 2014]. In practice, finding the best 
scoring graph among all possible graphs may not be feasible and search techniques 
over the space of graphs are required (e.g., see the paragraph “Greedy Search Tech- 
niques”). Regularization different from BIC is possible, too. Roos et al. [2008] 
base their score on the minimum description length principle [Griinwald, 2007], 
for example. Using work by Haughton [1988], Chickering [2002] discusses how 
the BIC approach relates to a Bayesian formulation that we discuss next. 


Bayesian Scoring Functions We define priors p,-(G) and pp,(0) over DAGs 
and parameters, respectively, and consider the log posterior as a score function 
(note that p(D) is constant over all DAGs): 


S(D,G) := log p(G|D) = log ppr(G) + log p(D|G), 


where p(D|G) is the marginal likelihood 
p(D|G)= | p(D1G.8) Ppr(8|9) 49. 


Here, the resulting estimator G from Equation (7.6) is the mode of the posterior 
distribution, which is usually called a maximum a posteriori (MAP) estimator. Al- 
ternatively, one may output the full posterior distribution over DAGs, and, in prin- 
ciple, even more detailed information is available. For instance, one can average 
over all graphs to get a posterior probability of the existence of a specific edge. 

As an example, consider random variables that take only finitely many values. 
For a given structure G, one may then assume that for each parent configuration the 
probability distribution of a random variable X; follows a multinomial distribution. 
If we put a Dirichlet prior on its parameters (together with some further conditions 
on parameter independence and modularity), this leads to the Bayesian Dirichlet 
(BD) score [Geiger and Heckerman, 1994b]. 


150 Chapter 7. Learning Multivariate Causal Models 


In the case of parametric models, we call two graphs G; and Gz distribution 
equivalent if for each parameter 0; there is a corresponding parameter 62, such 
that the distribution obtained from G, in combination with 0; is the same as the 
distribution obtained from graph G2 with 62, and vice versa. It can be shown (see 
Problem 7.12) that in the linear Gaussian case, for example, two graphs are dis- 
tribution equivalent if and only if they are Markov equivalent. It has therefore 
been argued that p(D|G,) and p(D|Gz) should be the same for Markov equiva- 
lent graphs G; and G2. The BD score can be adapted to satisfy this property. It is 
usually referred to as the Bayesian Dirichlet equivalence (BDe) score [Geiger and 
Heckerman, 1994b]. Buntine [1991] proposes a specific version of this score with 
even fewer hyperparameters. 


Greedy Search Techniques The search space of all DAGs is growing super- 
exponentially in the number of variables [e.g., Chickering, 2002], the numbers of 
DAGs for 2, 3, 4, and 10 variables are 3, 25, 543, and 4175098976430598 143, 
respectively (see Table B.1). Therefore, computing a solution to Equation (7.6) 
by searching over all graphs is often infeasible. Instead, greedy search algorithms 
can be applied to solve (7.6). At each step there is a candidate graph and a set of 
neighboring graphs. For all these neighbors, one computes the score and considers 
the best-scoring graph as the new candidate. If none of the neighbors obtains a 
better score, the search procedure terminates (not knowing whether one obtained 
only a local optimum). Clearly, one therefore has to define a neighborhood relation. 
Starting from a graph G, we may define all graphs as neighbors from G that can be 
obtained by removing, adding, or reversing one edge, for example. 

In the case of a linear Gaussian SCM, one cannot distinguish between Markov 
equivalent graphs. It turns out that then it is beneficial to change the search space 
to Markov equivalence classes instead of DAGs. The greedy equivalence search 
(GES) [Chickering, 2002] optimizes the BIC criterion (7.7) and starts with the 
empty graph. It consists of two-phases: in the first phase, edges are added until a 
local maximum is reached; in the second phase, edges are removed until a local 
maximum is reached, which is then given as an output of the algorithm. 


Exact Methods In general, finding the optimal scoring DAG is NP-hard [Chick- 
ering, 1996] but still there is a lot of interesting research that tries to scale up exact 
methods. Here, “exact” means that they aim at finding (one of) the best scoring 
graphs for a given finite data set. Greedy search techniques are often heuristic and 
have guarantees — if at all — only in the limit of infinite data. 
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One line of research is based on dynamic programming [Silander and Mylly- 
mak, 2006, Koivisto and Sood, 2004, Koivisto, 2006]. These approaches exploit 
the decomposability of many scores that are used in practice: due to the Markov 
factorization, we have for D = (X!,...,X”) that 


n 


d 
logp (D|ô, G) = $} } log p( (xi |X; Pag 6), 
j=li=1 


which is a sum of d “local” scores. Methods based on dynamic programming 
exploit this decomposability, and despite their exponential complexity they can 
find the best scoring graph for > 30 variables, even if one does not restrict the 
number of parents. This is a remarkable result given the enormous number of 
different DAGs over this number of variables (see Table B.1). 

The integer linear programming (ILP) framework assumes not only decompos- 
ability but also that the scoring function gives the same score to Markov equivalent 
graphs. The idea is then to represent graphical structures as vectors, such that the 
scoring function becomes an affine function in this vector representation. Studený 
and Haws [2014] describe how Hemmecke et al. [2012] base their representation 
on characteristic imsets, while Jaakkola et al. [2010] and Cussens [2011] use (ex- 
ponentially long) zero-one codes instead that indicate parent-child-relationships 
between nodes and reduce the search space exploiting work by De Campos and Ji 
[2011]. Having formulated the problem as an ILP problem, the problem is still NP- 
hard, but one may now use off-the-shelf methods for ILP. Restricting the number 
of parents leads to further advances, for example, in “pedigree learning” each node 
has at most two parents [Sheehan et al., 2014]. 


7.2.3 Additive Noise Models 


ANM s can be learned with score-based methods that are combined with a greedy 
search technique. This has been proposed for linear Gaussian models with equal 
error variances (Section 7.1.3) or nonlinear Gaussian ANMs (Section 7.1.5) [see 
Peters and Biihlmann, 2014, Biihlmann et al., 2014]. In the nonlinear Gaussian 
case, for example, we can proceed analogously to the bivariate case (see Equa- 
tions (4.18) and (4.19)). For a given graph structure G, we regress each variable on 
its parents and obtain the score 


d 
log p(D|G) = } — log var[R ;]; 
j=l 
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here, var[R ;] is the empirical variance of the residuals R; obtained from the regres- 
sion of variable X; on its parents. Intuitively, the better the model fits the data, the 
smaller the variance of the residuals and thus the larger our score. Formally, the 
procedure is an instance of maximum likelihood and can be shown to be consis- 
tent [Biihlmann et al., 2014]. Computationally, we can again exploit the property 
that the score decomposes over the different nodes. When computing the score 
for a neighboring graph that changes the parent set of only one variable, we need 
to update only the corresponding summand. If the noise cannot be assumed to 
have a Gaussian distribution, for example, one can estimate the noise distribution 
[Nowzohour and Biihlmann, 2016] and obtain an entropy-like score. 

Alternatively, one can estimate the structure in an iterative way using indepen- 
dence tests. Mooij et al. [2009] and Peters et al. [2014] propose a regression with 
subsequent independence test (RESIT). The method is based on the property that 
the noise variables are independent of all preceding variables. For linear non- 
Gaussian models (Section 7.1.4), Shimizu et al. [2006] provide a practical method 
based on ICA [Comon, 1994, Hyvärinen et al., 2001] that can be applied to a finite 
amount of data. Later, an improved version of this method has been proposed in 
Shimizu et al. [2011]. 


7.2.4 Known Causal Ordering 


It is often difficult to find the causal ordering (see Appendix B) of the underlying 
causal model. Given the causal ordering, however, estimating the graph reduces to 
“classical” variable selection. Assume, for example, that 


X := Ny 
Y := f (X,Ny) 
Z := 9(X,Y,Nz) 


with unknown f,g,Ny,Ny,Nz. Deciding whether f depends on X, and g depends 
on X and/or Y (see the assumption of structural minimality in Remark 6.6) is then 
a well-studied significance problem in “traditional” statistics. Standard methods 
can be used, especially if further structural assumptions are made, such as linearity 
[e.g., Hastie et al., 2009, Biihlmann and van de Geer, 2011]. This observation 
has been made before [e.g., Teyssier and Koller, 2005, Shojaie and Michailidis, 
2010] and it has been suggested that instead of searching over the space of directed 
acyclic graphs, it might be beneficial to search over the causal order first and then 
perform variable selection [e.g., Teyssier and Koller, 2005, BiihImann et al., 2014]. 
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7.2.5 Observational and Experimental Data 


Section 7.1.6 describes how causal structures may become identifiable when we 
observe the system under different conditions (“environments”). We now discuss 
how these results can be exploited in practice, that is, given only finitely many data. 
Let us therefore assume that we obtain one sample X}, for each environment e € £; 
that is, for each of the environments, we observe n° i.i.d. data points. 


Known Intervention Targets Here, each setting corresponds to an interven- 
tional experiment, and we have additional knowledge of the intervention targets 
T° C {1,...,p}. Cooper and Yoo [1999] incorporate the intervention effects as 
mechanism changes into a Bayesian framework. For perfect interventions, Hauser 
and BiihImann [2015] consider linear Gaussian SCMs and propose a greedy inter- 
ventional equivalence search (GIES), a modified version of the GES algorithm that 
we briefly described in Section 7.2.2. 

Sometimes, one is not able to measure all variables in each experiment (this can 
even be the case when all experiments are observational) but nevertheless wants to 
combine the information from the available data; this problem has been addressed 
by SAT-based approaches [see, e.g., Triantafillou and Tsamardinos, 2015, Tillman 
and Eberhardt, 2014, references therein]. 


Unknown Intervention Targets Eaton and Murphy [2007] do not assume that 
the targets of the different interventions are known. Instead, they introduce for 
each environment e € E an intervention node /, with no incoming edges (see “In- 
tervention Variables” on page 95); for each data point only one intervention node 
is active. Then, they apply standard methods to the enlarged model with d + #E 
variables, subject to the constraint that intervention nodes do not have any parents. 

Tian and Pearl [2001] propose to test whether the marginal distributions change 
in the different settings and use this information to infer parts of the graph structure. 
They even combine this method with an independence-based method. 


Different Environments In Section 7.1.6, we have also considered the problem 
of estimating the causal parents of a target variable Y among the set X of d predic- 
tors. Therefore, we have defined the set S as the collection of all sets S C {1,...,d} 
that satisfy invariant prediction, that is, for which Pye se remains invariant over all 
environments e € €; see (7.4). In practice, we can test the hypothesis of invariant 
prediction at level œ and collect all sets S that pass the test as an estimate S for 
the set S. Because the true set of parents PAy C X is a member of Ê with high 
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probability (1 — œ), we obtain the coverage statement 


(NS C PAy (7.8) 
SES 


with high probability (1 — a). The left-hand side of (7.8) is the output of a method 
called “invariant causal prediction” [Peters et al., 2016]. Code Snippet 7.11 shows 
an example for which the environments correspond to different interventions (this 
is not required by the method). To obtain correct coverage in the sense of (7.8), 
one only needs to model the conditional Y given PAy; in particular, one does not 
assume anything on the distribution of the d predictors X. This is different for 
the method proposed by Eaton and Murphy [2007] (see the paragraph “Unknown 
Intervention Targets”), which additionally tries to estimate the full causal structure. 


Code Snippet 7.11 The following code shows an example of a causal system in 
two environments. In the true underlying structure we have that X; and X2 are 
causing Y, which itself is causing X3. In a linear model on the pooled data (line 
13), all variables X1, X2, and X3 are highly significant since all of them are good 
predictors for Y. Such a model is not invariant, however. In the two environments a 
regression from Y on X1, X2, X; yields coefficients —0.15, 1.09, —0.39, and —0.32, 
1.62, —0.54, respectively. The method of invariant causal prediction outputs only 
the causal parents of Y, that is, X; and X2. In this example, {1,2} is the only set 
yielding an invariant model, that is, S = {{1,2}}. 


library (InvariantCausalPrediction) 

# 

# generate data from two environments 

env <- c(rep(1,400) ,rep(2,700) ) 

n <- length(env) 

set.seed(1) 

X1 <- rnorm(n) 

X2 <- 1*X1 + c(rep(0.1,400), rep(1.0,700))*rnorm(n) 
Y <- -0.7*X1 + 0.6*X2 + 0.1*rnorm(n) 

X3 <- c(rep(-2,400),rep(-1,700))*Y + 2.5*X2 + 0.1*rnorm(n) 
# 

summary (1m(Y~-1+X1+X2+X3) ) 

# Coefficients: 

# ----Estimate Std.Error t.val. Pr(>/t/) 
# X1 -0.396212 0.008667 -45.71 <2e-16 *** 
# X2 +1.381497 0.021377 +64.63 <2e-16 *** 
# X3 -0.410647 0.011152 -36.82 <2e-16 *** 
# 

ICP (cbind(X1,X2,X3),Y,env) 

#lower bd upper bd p-value 

# X1 -0.71 -0.68 3.7e-06 *** 

# X2 +0.59 +0.61 0.0092 ** 

# X3 -0.00 +0.00 0.2972 
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7.3 Problems 


Problem 7.12 (Gaussian SCMs) Prove that for linear Gaussian SCMs, two 
graphs G, and Gh are distribution equivalent if and only if they are Markov equiv- 
alent. Here, we allow for zero coefficients. 


Problem 7.13 (Gaussian SCMs) Consider a distribution Px of X = (X1,...,Xa) 
with density p induced from a linear Gaussian SCM €. Prove that for any DAG 
G such that Px is Markovian with respect to G, there is a corresponding linear 
Gaussian SCM €g entailing Px. 


Problem 7.14 (ANMs) Prove that ANMs over X = (Xj,...,Xq) with differen- 
tiable functions f; and noise variables that have a strictly positive density entail a 
distribution over X that has a strictly positive density, too (see Definition 7.3). 


Problem 7.15 (Invariant causal prediction) Prove Equation (7.5). 


8 


Connections to Machine Learning, H 


As argued in Chapter 5, the causal structure that underlies a statistical model can 
have strong implications for machine learning tasks such as semi-supervised learn- 
ing or domain adaptation. We now revisit this general topic, focusing on the multi- 
variate case. We begin with a method that uses machine learning to model system- 
atic errors for a given causal structure, followed by some thoughts on reinforce- 
ment learning (with an application in computational advertising), and finally we 
comment on the topic of domain adaptation. 


8.1 Half-Sibling Regression 


This method exploits a given causal structure (see Figure 8.1) to reduce system- 
atic noise in a prediction task. The goal is to reconstruct the unobserved signal Q. 
Schoélkopf et al. [2015] suggest that we can denoise the signal Y by removing all in- 
formation that can be explained by other measurements X that have been corrupted 
with the same source of noise. Here, X are measurements of some signals R that 
are independent of Q. Intuitively, everything in Y that can be explained by X must 
be due to the systematic noise N and should therefore be removed. More precisely, 
we consider 


A 


0 :=Y -EY |X] 


as an estimate for Q. Here, E[Y |X] is the regression of Y on its half-siblings X 
(note that X and Y share the parent N; see Figure 8.1). 
One can show that for any random variables Q,X,Y that satisfy Q lL X, we have 
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Figure 8.1: The causal structure that applies to the exoplanet search problem. The underly- 
ing signal of interest Q can only be measured as a noisy version Y. If the same noise source 
also corrupts measurements of other signals that are independent of Q, those measurements 
can be used for denoising. In our example, the telescope N constitutes systematic noise that 
affects measurements X and Y of independent light curves. 


[Schölkopf et al., 2016, Proposition 1]: 


z[(@-Elo]-0)°] < e[(e-Ela- Y -£(")))’], 


that is, the method is never worse than taking the measurement Y. If, moreover, 
the systematic noise acts in an additive manner, that is, Y = Q + f (N) for some 
(unknown) function f, we have [Schölkopf et al., 2016, Proposition 3]: 


=| (Q-£[0]—6)"] = Elvar f )Ix]. (8.1) 


If the additive noise is a function of X, that is, f(N) = y(X) for some (unknown) 
function y, then the right-hand side of (8.1) vanishes and hence Ô recovers Q up 
to an additive shift; see Schölkopf et al. [2016] for other sufficient conditions. 

As an example, consider the search for exoplanets. The Kepler space observatory, 
launched in 2009, observed a small fraction of the Milky Way during its search for 
exoplanets, monitoring the brightness of approximately 150,000 stars.! Those stars 
that are surrounded by a planet with a suitable orbit to allow for partial occlusions 
of the star will exhibit light curves that show a periodic decrease of light intensity; 
see Figure 8.2. These measurements are corrupted with systematic noise that is 
due to the telescope and that makes the signal from possible planets hard to detect. 

Fortunately, the telescope measures many stars at the same time. These stars can 
be assumed to be causally and therefore statistically independent since they are 
light-years apart from each other. Thus, the causal structure depicted in Figure 8.1 
fits very well to this problem and we may apply the half-sibling regression. This 
simple method performs surprisingly well [Schölkopf et al., 2015]. 


lhttps://en.wikipedia.org/wiki/Kepler_(spacecraft), accessed 13.07.2016. 
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Figure 8.2: Every time a planet occludes a part of the star, the light intensity decreases. 
If the planet orbits the star, this phenomenon occurs periodically. (Image courtesy of 
Nikola Smolenski, https://en.wikipedia.org/wiki/File:Planetary_transit. 
svg, [CC BY-SA 3.0]. Image has been edited for clarity and style.) 


Related approaches have been used in other application fields without reference 
to causal modeling [Gagnon-Bartsch and Speed, 2012, Jacob et al., 2016]. Con- 
sidering the causal structure of the problem (Figure 8.1) immediately suggests the 
proposed methodology and leads to theoretical arguments justifying the approach. 


8.2 Causal Inference and Episodic Reinforcement 
Learning 


We now describe a class of problems in reinforcement learning from a causal per- 
spective. Roughly speaking, in reinforcement learning, an agent is embedded in 
a world and chooses among a set of different actions. Depending on the current 
state of the world, these actions yield some reward and change the state of the 
world. The goal of the agent is to maximize the expected cumulated reward (see 
Section 8.2.2 for more details). We first introduce the concept of inverse prob- 
ability weighting that has been applied in different contexts throughout machine 
learning and statistics and then relate it to episodic reinforcement learning. Draw- 
ing this connection is a first small step toward relating causality and reinforcement 
learning. The causal point of view enables us to exploit conditional independences 
that directly follow from the causal structure. We briefly mention two applications 
— blackjack and the placement of advertisement — and show how they benefit 
from causal knowledge. The causal formulation leads to these improvements of 
methodology very naturally but it is certainly possible to formulate these problems 
and corresponding algorithms without causal language. This section does not prove 
that reinforcement learning benefits from causality. Instead, we regard it as a step 
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toward establishing a formal link between these two fields that may lead to fruitful 
research in future [see also Bareinboim et al., 2015, for example]. More concretely, 
we believe that causality could play a role when transferring knowledge between 
different tasks in reinforcement learning (e.g., when progressing to the next level 
in a computer game or when changing the opponent in table tennis); however, we 
are not aware of any such result. 


8.2.1 Inverse Probability Weighting 


Inverse probability weighting is a well-known technique that is used to estimate 
properties of a distribution from a sample that follows a different distribution. It 
therefore naturally relates to causal inference. Consider the kidney stone example 
(Example 6.37). We defined the binary variables size S, treatment T, and recov- 
ery R, and after obtaining observational data, we were interested in the expected 
recovery rate E[R] in a hypothetical study in which everyone received treatment 
A, that is under a different distribution. Formally, consider an SCM € entailing 
the distribution pe over variables X = (X),...,X,). We have argued that one often 
observes a sample from the observational distribution Pe. but one is interested in 


some intervention distribution PŠ. Here, the new SCM € is constructed from the 
original € by intervening on a node Xz, say, 


do (X: =F 


see Section 6.3. In particular, we might want to estimate a certain property 


E &(X) := Epe &(X) 


of the new distribution PŠ (in the kidney stone example, this is E[R]). If densities 
exist, we have seen in Section 6.3 that the densities of € and € factorize in a similar 
way: 


d 
Pip SP ia ee) = [| (x; lXpa(j)) and 
j=l 


DXi ta) i= p®(x1,..-,%a) = [I (x; |Xpa()) Ë (xr | Xpa(K)) i 
jA 
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The factorizations agree except for the term of the intervened variable. We there- 


fore have 
ait O ) BE x) 
E := Ë (X) = fee x) dx = [es mx) 


XK | Xz 
= | ex) (xe | eh 


a w) 


(For simplicity, we assume throughout the whole section that the densities are 


strictly positive.) Given a sample X!,...,X” drawn from the distribution PẸ, we 
can thus construct an estimator 
eo Cece 7 jä 


En =- $ eX’) » (XIX, yaa nat (X')w (8.2) 
(k 


for é = Ë (X) by reweighting the observations; here, the weights w; are defined as 
the ratio of the conditional densities. The data points, that have a high likelihood 
under PE (they “could have been drawn” from the new distribution of interest) 


receive a large weight and contribute more to the estimate Ê, than those with a 
small weight. This kind of estimator appears in the following three situations, for 
example. 


(i) Suppose that X = (Y, Z) contains only a target variable Y and a causal covari- 
ate Z, that is, Z — Y. Let us consider an intervention in Z and the function 
(X) = €((Z,Y)) = Y. Then, the estimator (8.2) reduces to 

ge LA y BCZ’) 
on = yi, (8.3) 
">on nd p(z) 
which is known as the Horvitz-Thompson estimator [Horvitz and Thomp- 
son, 1952]. This setting corresponds to the assumption of covariate shift 
[e.g., Shimodaira, 2000, Quionero-Candela et al., 2009, Ben-David et al., 
2010]; see also Sections 5.2 and 8.3. The estimator (8.3) is an example of a 
weighted likelihood estimator. 


(ii) For X = Z, we may estimate the expectation Ë [¢(Z)] under ñ using data 
sampled from p. Thus, Equation (8.2) reduces to 


ê 1 


m= A 


s 


ae 
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a formula that is known as importance sampling [e.g., MacKay, 2002, 
Chapter 29.2]. The formula can be adapted if p and ø are known only up 
to constants. 


(iii) We will make use of Equation (8.2) in the context of episodic reinforcement 
learning. We describe this application in a bit more detail next. 


8.2.2 Episodic Reinforcement Learning 


Reinforcement learning [e.g Sutton and Barto, 2015] models the behavior of 
agents taking actions in a world. Depending on the current state S, of the world 
and the action A;, the state of the world changes according to a Markov decision 
process, for example [e.g., Bellman, 1957]; that is, the probability P(S;+1 = s) of 
entering a new state s depends only on the current state S; and action A;. Further- 
more, the agent will receive some reward R; that depends on S,, A+, and S,+1; the 
sum over all rewards is sometimes called the return, which we write as Y := Y, Ry. 
The way the return Y depends on states and action is unknown to the agent who 
tries to improve his strategy (a,s) ++ m(a|s) := P(A; =a|S, = s), that is, the con- 
ditional of the action he chooses depending on the observational part of the state 
of the world. In episodic reinforcement learning, the state is reset after a finite 
number of actions (see Figure 8.3). In Section 8.2.3, we consider the example of 


LAA 


i aS 


\ Hf) > 


Figure 8.3: The graph describes an episodic reinforcement learning problem. The action 
variables A; influence the system’s next state S;,,;. The variable Y describes the output 
or return that we receive after one episode. This return Y may depend on the actions, 
too (edges omitted for clarity); it is often modelled as the (possibly weighted) sum of 
rewards that are received after each decision; see Section 8.2.3. The whole system can be 
confounded by an unobserved variable H. The bold, red edges indicate the conditionals 
that the player can influence, that is, the strategy. Equation (8.4) estimates the expected 
outcome E[Y] under a strategy # from data obtained using strategy 2. The equation still 
holds, when there are additional edges from the actions A to H and/or Y. 
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blackjack. In the example of Figure 8.3, the player makes K = 3 decisions, after 
which the cards are reshuffled. Then, a new episode starts. 

Suppose that we play n games under a certain strategy (a,s) ++ 7(a |s), and each 
game is an episode. This function 7 does not depend on the number of “moves” 
we have played so far but just on the value of the state. As long as this strategy 
assigns a positive probability to any action, Equation (8.2) allows us to estimate 
the performance of a different strategy (a,s) > #(a|s). 

1 n 


pry T RASIS) 
n iZ M n(A; |S) 


This can be seen as a Monte Carlo method for off-policy evaluation [Sutton and 
Barto, 2015, Chapter 5.5]. In practice, the estimator (8.4) often has large variance; 
in continuous settings the variance may even be infinite. It has been suggested to 
reweight [Sutton and Barto, 2015] or to disregard the (five) largest weights [Bottou 
et al., 2013] to trade off variance for bias. Bottou et al. [2013] additionally compute 
confidence intervals and gradients in the case of parametrized densities. The latter 
are important if one wants to search for optimal strategies. 

We now briefly discuss two examples, in which exploiting the causal structure 
leads to an improved statistical performance of the learning procedure. We re- 
gard them as interesting examples that shed some light on the relationship between 
reinforcement learning and causality. 


Ê ERL := (8.4) 


8.2.3 State Simplification in Blackjack 


The methodology proposed in Section 8.2.2 can be used to learn how to play black- 
jack (a card game). We pretend that a player enters a casino and starts playing 
blackjack knowing neither the objective of the game nor the optimal strategy; in- 
stead, he applies a random strategy. At each point in the game, the player is asked 
which of the legal actions he wants to take, and after the game has finished the 
dealer reveals how much money the player won or lost. After a while the player 
may update his strategy toward decisions that proved to be successful and continue 
playing. From a mathematical point of view, blackjack is solved. The optimal 
strategy (for infinitely many decks) was discovered by Baldwin et al. [1956] and 
leads to an expectation of E[Y] ~ —0.006€ for a player betting 1€. 

How does causality come into play? We have assumed that the player is unaware 
of the precise rules of blackjack; maybe he knows, however, that the win or loss 
is determined only by the values of the cards and not their suits; that is, the rules 
do not distinguish between a queen of clubs and a queen of hearts. The player can 
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Figure 8.4: Here, there exist variables F; ,...,F4 that contain all relevant information about 
the states S;,...,S4 in the sense that Equations (8.5) and (8.6) hold. Equation (8.6) is not 
represented in the graph. Then, it suffices if the actions A ; depend on F;—; (red, solid lines) 
rather than S;_; (red, dashed lines). In the blackjack example, the S';’s encode the dealer’s 
hand and player’s hand including suits, while the F; encode the same information except 
for suits (suits do not have an influence on the outcome of blackjack). Since F; take fewer 
values than S;, the optimal strategy becomes easier to learn. 


then immediately conclude that the optimal strategy does not depend on the suit. 
This comes with an obvious advantage when searching for the optimal strategy: 
the number of relevant state spaces and therefore the space of possible strategies 
reduces significantly. Figure 8.4 depicts this argument: the variables S, contain all 
information, whereas the variables F; do not contain suits. For example, 


S3 = (Player: OK,@5,4; Dealer: >K) 
F; = (Player: K, 5, 4; Dealer: K). 


Since the final result Y depends only on (F\,...,74) and not on the “full state” 
(S,,...,S4), the actions may be chosen to depend on the F variables. Similarly, 
one may exploit that the order of the cards does not matter either. More formally, 
we have the following result: 


Proposition 8.1 (State simplification) Suppose that we are interested in the re- 
turn Y := ¥) Rj, and all variables are discrete. Assume that there is a function f 
such that for all j and for Fj := f (Sj), we have 


Ry JL S;|F;j,Aj, (8.5) 
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and the full states do not matter for the change of states in the following sense: for 


all sj and for all sj—1, s_, with fis) = f (s$) 


P(F(s;) |8j-1) = PE (sj) | 55-1). (8.6) 


Then the optimal strategy (a,8) ++ Mop:(a|s) depends only on Fj and not on Sj. 
There exists 


Topt E€ argmax E[Y], 
T 
such that 


Topt(aj |Sj-1) = Topla j|S$-1) Ysj=1;8$-1 : F(sj-1) = f(s$-1) 


This result is particularly helpful if F; takes fewer values than S;. The proof is 
provided in Appendix C.11. In the blackjack example, Equation (8.6) states that 
the probability of drawing another king depends only on the values of the cards 
drawn before (the number of kings in particular), not their suits. 


8.2.4 Improved Weighting in Advertisement Placement 


A related argument is used by Bottou et al. [2013] for the optimal placement of 
advertisements. Consider the following simplified description of the system. A 
company, which we will refer to as the publisher, runs a search engine and may 
want to display advertisements in the space above the search results, the main- 
line. Only if a user clicks on an ad does the publisher receive money from the 
corresponding company. Before displaying the ads, the publisher sets the mainline 
reserve A, a real-valued parameter that determines how many ads are shown in the 
mainline. In most systems, the number of mainline ads F varies between 0 and 4, 
that is, F € {0,1,2,3,4}. The mainline reserve A usually depends on many vari- 
ables (e.g., search query, date and time of the query, location), that we call the state 
S. If the search query indicates that the user intends to buy new shoes, for example, 
one may want to show more ads compared to when a user is looking for the time 
of the next service at church. We can model the system as episodic reinforcement 
learning with episodes of length 1.? The return Y equals the number of clicks per 
episode; its value is either 0 or 1. The question how to choose an optimal mainline 
reserve A then corresponds to finding the optimal strategy (a,s) > Mopt(a|s). Fig- 
ure 8.5 shows a picture of the simplified problem. The state S contains information 


In reality, the systems are usually more complicated. For example, in an auction-like procedure, 
the advertisers place bids on certain search queries, which then influence the price for a click. 
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Figure 8.5: Example for the placement of advertisements. The target variable Y indicates 
whether a user has clicked on one of the shown ads. H (unknown) and S (known) are state 
variables and the action A corresponds to the mainline reserve, a real-valued parameter that 
determines how many ads are shown in the mainline. F is a discrete variable indicating 
the (known) number of ads placed in the mainline. Although the conditional p(a |s) is 
randomized over, we may use p(f |s) for the reweighting (see Proposition 8.2). 


about the user that is available to the publisher. The hidden variable H contains 
unknown user information (e.g., his intention), the action A is the mainline reserve, 
and Y is the event whether or not a person clicks on one of the ads. Finally, F is 
the discrete variable that says, how many ads are shown. Evaluating new strategies 
(a,s) ++ p(a|s), corresponds to applying Equation (8.4): 


Lg pi A'S 
ni p(A?| S*)" 


EERI = 


(Here, we write p(a|s) rather than z(a|s) for notational convenience.) We can 
now benefit from the following key insight. Whether a person clicks on an ad 
depends on the mainline reserve A but only via the value of F. The user never 
sees the real-valued parameter A. This is a somewhat trivial observation, when we 
think about the causal structure of the system (see Figure 8.5). Exploiting this fact, 
however, we can use a different estimator 


see Proposition 8.2. And since F is a discrete variable taking values between 0 
and 4, say, this usually leads to weights that are much better behaved. In practice, 
the modification may reduce the size of confidence intervals considerably [Bottou 
et al., 2013, Section 5.1]. As in Section 8.1, we can exploit our knowledge of the 
causal structure to improve statistical performance. More formally, the procedure 
is justified by the following proposition: 
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Method | Training data from | Test domain 
Domain generalization tk) T:=D+1 
Multi-task learning (X! Y!) (XP, YP) | T€ {1,...,D} 
Asymmetric multi-task learning || (X!,Y!),...,(X?,Y?) P=) 


Table 8.1: In domain generalization, the test data come from an unseen domain, whereas 
in multi-task learning, some data in the test domain(s) are available. 


Proposition 8.2 (Improved weighting) Suppose there is a density p over X = 
(A, F,H,S,Y) that is entailed by an SCM € with graph shown in Figure 8.5. Assume 
further that the density p is entailed by an SCM € that corresponds to an interven- 
tion in A of the form do (A := f(S,Na)) and satisfies p(f |s) = 0 if p(f |s) =O and 
P(a|s) = 0 if p(a|s) =0. We then have 
z B(a|s) BCF Is) 
wap A Aa A 

The proof can be found in Appendix C.12. In general, the condition of the non- 
vanishing densities is indeed necessary: if there is a set of a and s values (with 
non-vanishing Lebesgue measure) that belong to the support of p and contribute to 
the expectation of Y, there must be a non-vanishing probability under p to sample 
data in this area. 


8.3 Domain Adaptation 


Domain adaptation is another machine learning problem that is naturally related to 
causality [Schélkopf et al., 2012]. Here, we will relate domain adapation to what 
we called invariant prediction in “Different Environments” in Section 7.2.5. We do 
not claim that this connection, in its current form, yields major improvements, but 
we believe that it could prove to be useful for developing a novel methodology in 
domain adaptation. 

Let us assume that we obtain data from a target variable Y° and d possible pre- 
dictors X° = (Xf,...,X‘) in different domains e € € = {1,...,D} and that we are 
interested in predicting Y. Adapting to widely used notation, we use the terms 
“domain” or “task.” Table 8.1 describes a taxonomy of three problems in domain 
adaptation that we consider here. 

Our main assumption is that there exists a set S* C {1,...,d} such that the con- 
ditional Y° | X$. is the same for all domains e € E, including the test domain, that 
is, for all e, f € E and for all Xs 


Y°|X%. =xs: and Yf xf . =xs have the same distribution. (8.7) 
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In Sections 7.1.6 and 7.2.5 we have considered a similar setup, where we used the 
term “environments” rather than “domains” and called the property (8.7) “invariant 
prediction.” We have argued that if there is an underlying SCM and if the environ- 
ments correspond to interventions on nodes other than the target Y, property (8.7) 
is satisfied for S* = PAy (cf. also our discussion of Simon’s invariance criterion 
in Section 2.2). Property (8.7) may also hold, however, for sets other than the 
causal parents. Since our goal is prediction, we are most interested in sets S* that 
satisfy (8.7) and additionally predict Y as accurately as possible. Let us for now 
assume, that we are given such a set S* (we will return to this issue later) and point 
at how the assumption (8.7) relates to domain adaptation. 

In settings of covariate shift [e.g., Shimodaira, 2000, Quionero-Candela et al., 
2009, Ben-David et al., 2010], one usually assumes that the conditional Y° | X° = x 
remains invariant over all tasks e. Assumption (8.7) means that covariate shift 
holds for some subset S* of the variables and thus constitutes a generalization of 
the covariate shift assumption. 

For domain generalization, and if the set S* is known, we can then apply tradi- 
tional methods for covariate shift for this subset S*. For example, if the supports 
of the data in input space are overlapping (or the system is linear), we may use the 
estimator fs (X7. ) with fs (x) := E [Y! |X} =x] in test domain T. One can prove 
that this approach is optimal in an adversarial setting, where the distributions in 
the test domain may be arbitrarily different from the training domains, except for 
the conditional distribution (8.7) that we require to remain invariant [Rojas-Carulla 
et al., 2016, Theorem 1]. In multi-task learning, it is less obvious how to exploit 
the knowledge of such a set S*. In practice, one needs to combine information 
gained from pooling the tasks and regressing Y on S* with knowledge obtained 
from considering the test task separately [Rojas-Carulla et al., 2016]. 

If the set S* is unknown, we again propose to search for sets S that satisfy (8.7) 
over available domains. When learning the causal predictors, one prefers to stay 
conservative, and the method of invariant causal prediction [Peters et al., 2016] 
therefore outputs the intersection of all sets S satisfying (8.7); see Equation (7.5). 
Here, we are interested in prediction instead. Among all sets that lead to invariant 
prediction, one may therefore choose the set S that leads to the best predictive 
performance, which is usually one of the larger of those sets. The same applies if 
there are different known sets S that all satisfy (8.7). If the data are generated by 
an SCM and the domains correspond to different interventions, the set S with the 
best predictive power that satisfies (8.7) can, in the limit of infinite data, be shown 
to be a subset of the Markov blanket of Y (see Problem 8.5). 
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8.4 Problems 


Problem 8.3 (Half-sibling regression) Consider the DAG in Figure 8.1. The fact 
that X provides additional information about Q on top of the one provided by Y 
follows from causal faithfulness. Why? 


Problem 8.4 (Inverse probability weighting) Consider an SCM € of the form 
Z:= Nz 
Y := Z? +Ny, 

with Ny , Nz 4 A(O, 1) and an intervened version Č with 
do (Z := Ñz), 


where Ñz ~ N (2,1). 


a) (optional) Compute E[Y] := Epe[Y] and Ë[Y] := E p [Y]. 


b) Draw n = 200 i.i.d. data points from the SCM € and implement the estima- 
tor (8.3) for estimating E[Y]. 


c) Compute the estimate in b) and the empirical variance of the weights ap- 
pearing in (8.3) for increasing sample size n between n = 5 and n = 50,000. 
What do you conclude? 


Problem 8.5 (Invariant predictors) We want to justify the last sentence in Sec- 
tion 8.3. Consider a DAG over variables Y, E, and X1,...,Xq, in which E (for 
“environment” ) is not a parent of Y and does not have any parents itself. Denote 
the Markov blanket of Y by M. Prove that for any set S C {X1,..., Xa} with 


Y ILEJ|S 
there is another set Spey C M such that 


VALE Sic and Y JL (S\ Sie) Seas 
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Hidden Variables 


So far, we assumed that all variables from the model have been measured (except 
for the noises). Since in practice, we are choosing the set of random variables 
ourselves, we need to define a concept of “causally relevant” variables. In Sec- 
tion 9.1 we therefore introduce the terms “causal sufficiency” and “interventional 
sufficiency.” But even if we leave aside the details of the precise definition, it is 
apparent that in most practical applications many causally relevant variables will 
be unobserved. Simpson’s paradox (Section 9.2) describes how ignoring hidden 
confounding can lead to wrong causal conclusions. In linear settings, a structure 
that is often referred to as an instrumental variable can make the regression co- 
efficient, which corresponds to the causal effect (see Example 6.42), identifiable 
(Section 9.3). It is an active field of research to find good graphical representations 
for SCMs with hidden variables, in particular those that encode the conditional in- 
dependence structure; we will present some of the solutions in Section 9.4. Finally, 
hidden variables lead to constraints appearing in the observed distribution that go 
beyond conditional independences (Section 9.5). We briefly discuss how these con- 
straints could be used for structure learning but do not provide any methodological 
details. For more historical notes on the treatment of hidden variables, we refer to 
Spirtes et al. [2000, Section 6.1]. 


9.1 Interventional Sufficiency 
A set of variables X is usually said to be causally sufficient if there is no hidden 


common cause C ¢ X that is causing more than one variable in X [e.g., Spirtes, 
2010]. While this definition matches the intuitive meaning of the set of “relevant” 
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variables, it uses the concept of a “common cause” and should therefore be under- 
stood relative to a larger set of variables X D X (for which, again, we might want 
to define causal sufficiency). In the structural causal model corresponding to this 
larger set X, a variable C is a common cause of X and Y if there is a directed path 
from C to X and Y that does not include Y and X, respectively. Common causes 
are also called confounders and we use these terms interchangeably. 

We propose a small modification of causal sufficiency that we call interventional 
sufficiency, a concept that is based on falsifiability of SCMs; see Section 6.8. 


Definition 9.1 (Interventional sufficiency) We call a set X of variables inter- 
ventionally sufficient if there exists an SCM over X that cannot be falsified as an 
interventional model; that is, it induces observational and intervention distribu- 
tions that coincide with what we observe in practice. 


We believe that this concept is intuitively appealing since it describes when a set 
of variables is large enough to perform causal reasoning, in the sense of computing 
observational and intervention distributions. 

It should be intuitive that considering two variables is usually not sufficient if 
there exists a latent common cause. The two variables are causally insufficient by 
definition, and Simpson’s paradox in Section 9.2 (see also Example 6.37) shows 
that in general these two variables are not interventionally sufficient either. In fact, 
the paradox drives the statement to an extreme: an SCM over the two observed 
variables that ignores confounding does not only entail the wrong intervention dis- 
tributions, it can even reverse the sign of the causal effect: a treatment can look 
beneficial although it is harmful; see (9.2). 

Sometimes, however, we can still compute the correct intervention distributions 
even in the presence of latent confounding. The set of variables in the following 
example is interventionally sufficient but causally insufficient. 


Example 9.2 Consider the following SCM 


Z= Nz 

X := 1z>2 + Nx 

Y := Zmod2 +X + Ny 
with Nz ~U ({0,1,2,3}) being uniformly distributed over {0,1,2,3} and Ny, Ny A 
N (0,1); see Figure 9.1 (left). While variables X and Y are clearly causally insuffi- 
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G—0O @O—-— © 
Figure 9.1: Both graphs represent interventionally equivalent SCMs for the model de- 


scribed in Example 9.2. While only the second representation renders X and Y causally 
sufficient, X and Y are interventionally sufficient independently of the representation. 


cient,! one can show that the two variables X and Y are interventionally sufficient. 
The reason is that the “confounder” Z consists of two independent parts: Z; := 1z>2 
is the first bit of the binary representation of Z, and Z2 := Z mod2 is the second bit. 
In this sense, we can separate the “confounder” into the independent variables Z: 
and Z2, with Z; influencing X, and Z2 influencing Y; see Figure 9.1. 


In general, we have the following relationship between causal and interventional 
sufficiency (see Appendix C.13 for a proof): 


Proposition 9.3 (Interventional sufficiency and causal sufficiency) Let € be an 
SCM for the variables X that cannot be falsified as an interventional model. 


(i) Ifa subset O C X is causally sufficient, then it is interventionally sufficient. 


(ii) In general, the converse is false; that is, there are examples of intervention- 
ally sufficient sets O C X that are not causally sufficient. 


Furthermore, Example 9.2 shows that there cannot be a solely graphical criterion 
for determining whether a subset of the variables are interventionally sufficient. 
For many SCMs with a structure similar to Figure 9.1 (left), X and Y are inter- 
ventionally insufficient. However, the following remark shows that omitting an 
“intermediate” variable preserves interventional sufficiency. 


Remark 9.4 We have the following three statements. 


(i) Assume that there is an SCM over X,Y,Z with graph X > Y > Zand X ¥ Z 
that induces the correct interventions. Then X and Z are interventionally 
sufficient due to the SCM over X,Z satisfying X — Z. 


‘Here, the hidden common cause Z not only points into X and Y but also has a total causal effect 
on both of them; see Definition 6.12. 
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(ii) Assume that there is an SCM € over X,Y,Z that induces the correct interven- 
tions with graph X — Y — Z and additional X — Z and assume further that 
Pres is faithful with respect to this graph; see also (iii). Then, again, X and 
Z are interventionally sufficient due to the SCM over X,Z satisfying X — Z. 


Gii) If the situation is the same as in (ii) with the difference that 


€ _ p&do(X:=x) _ pé 
P3x=x = Fz =P7 


for all x (in particular, Pes is not faithful with respect to the graph). Then, 
X and Z are interventionally sufficient due to the SCM over X,Z with the 
empty graph. Note that the counterfactuals may not be represented correctly. 


The proof of these statements is left to the reader (see Problem 9.10). 


Whenever we find an SCM over the observed variables that is interventionally 
equivalent to the original SCM over all variables, we may want to call the former 
one a marginalized SCM. We have seen that there is no solely graphical criteria for 
determining the structure of a marginalized SCM. Instead, some information about 
the causal mechanisms, that is, the specific form of the assignments, is needed. 
Bongers et al. [2016] studies marginalizations of SCMs in more detail. The key 
idea is to start with the original SCM and to consider only the structural assign- 
ments of the observed variables. One then repeatedly plugs in the assignments of 
the hidden variables whenever they appear on the right-hand side. This yields an 
SCM with multivariate, possibly dependent noise variables. In some cases, it is 
then possible to choose an interventionally equivalent SCM with univariate noise 
variables. 


9.2 Simpson’s Paradox 


The kidney stone data set in Example 6.16 is well known for the following reason. 
We have 


PĒ(R =1|T =A) < PČ(R=1|T =B) but 
p&:do(T:=A) (R — 1) > p&:do(T:=B) (R — 1); (9.1) 


see Example 6.37. Suppose that we have not measured the variable Z (size of the 
stone) and furthermore that we do not even know about its existence. We might 
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then hypothesize that T — R is the correct graph. If we denote this (wrong) SCM 
by Č, we can rewrite (9.1) as 


pedo(T:=A) (R = 1) < pt:4o(T:=B) (R = 1) but 
p&:do(T:=A) (R - 1) > p&do(T:=B) (R = 1). (9.2) 


Due to the model misspecification, the causal statement gets reversed. Although 
A is the more effective drug, we propose to use B. But even if we knew about 
the common cause Z, is it possible that there is yet another confounding variable 
that we did not correct for? If we are unlucky, this is indeed the case and we 
have to reverse the conclusion once more if we include this variable. In principle, 
this could lead to an arbitrarily long sequence of reversed causal conclusions (see 
Problem 9.11). 

This example shows how careful we have to be when writing down the under- 
lying causal graph. In some situations, we can infer the DAG from the protocol 
describing the acquisition of the data. If the medical doctors assigning the treat- 
ments, for example, did not have any knowledge about the patient other than the 
size of the kidney stone, there cannot be any confounding factor other than the size 
of the stone. 

Summarizing, the Simpson’s paradox is not so much of a paradox but rather a 
warning of how sensitive causal reasoning can be with respect to model misspec- 
ifications. Although we have phrased the example in a setting with confounding, 
it can also occur as a result of selection bias (Example 6.30) that has not been 
accounted for. 


9.3 Instrumental Variables 


Instrumental variables date back to the 1920s [Wright, 1928] and are widely used 
in practice [see, e.g., Imbens and Angrist, 1994, Bowden and Turkington, 1990, 
Didelez et al., 2010]. There exist numerous extensions and alternative methods; 
we focus on the essential idea. Consider a linear Gaussian SCM with the graph 
shown in Figure 9.2 (left). Here, the coefficient @ in the structural assignment 


Y := aX + ôH +Ny 


is the quantity of interest (see Equation (6.18) in Example 6.42); it is sometimes 
called the average causal effect (ACE). It is not directly accessible, however, be- 
cause of the hidden common cause H. Simply regressing Y on X and taking the 
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regression coefficient generally results in a biased estimator for a: 


cov |X, ¥] _ a var[X]+ dyvar[H]| ee ôyvar|H] 
var|X] var |X = var[X] 


Instead, we may be able to exploit an instrumental variable — if it exists. For- 
mally, we call a variable Z in an SCM an instrumental variable for (X,Y) if (a) 
Z is independent of H, (b) Z is not independent of X (“relevance”), and (c) Z ef- 
fects Y only through X (“exclusion restriction”). For our purposes, it suffices to 
consider the example graph shown in Figure 9.2 (left) that satisfies all of these 
assumptions. Note, however, that other structures do, too. For example, one can 
allow for a hidden common cause between Z and X. In practice, one usually uses 
domain knowledge to argue why conditions (a), (b), and (c) hold. 

In the linear case, we can exploit the existence of Z in the following way. Because 
(H,Nyx) is independent of Z, we can regard yH + Ny in 


X := BZ + yH +Nx 


as noise. It becomes apparent that we can therefore consistently estimate the coef- 
ficient B and therefore have access to BZ (which, in the case of finitely many data, 
is approximated by fitted values of Z). Because of 


= aX + ôH +Ny = a(BZ)+(ay+5)H+Ny, 


we can then consistently estimate a by regressing Y on BZ. Summarizing, we first 
regress X on Z and then regress Y on the predicted values BZ (predicted from the 
first regression). The average causal effect œ becomes identifiable in the limit of 
infinite data. This method is commonly referred to as “two-stage least squares.” 
It makes use of linear SCMs, and the above-mentioned assumptions: (a) indepen- 
dence between H and Z, (b) non-zero ß (in the case of small or vanishing p, Z is 
called a “weak instrument’), and (c) the absence of a direct influence from Z to Y. 

Identifiability is not restricted to the linear setting, however. We now mention 
only four such results, even though there are many more [e.g., Hernan and Robins, 
2006]. 


(i) It is not difficult to see that the method of two-stage least squares still works 
if X depends on Z and H in a nonlinear but additive way; see Problem 9.12. 


(ii) If the variables Z, X, and Y are binary, the ACE is defined as 


peer) (y = 1) — p&do(x:=0) (y = 1). 
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Figure 9.2: Left: setting of an instrumental variable (Section 9.3). A famous example 
is a randomized clinical trial with non-compliance: Z is the treatment assignment, X the 
treatment and Y the outcome. Right: Y-structure; see Section 9.4.1. 


Balke and Pearl [1997] provide (tight) lower and upper bounds for the ACE 
without further assumptions on the relation between Y on X and H, for ex- 
ample. These bounds can be rather uninformative or they can collapse to a 
single point. In the latter case, we call the ACE identifiable. 


Gii) Wang and Tchetgen Tchetgen [2016] show that, still in the case of binary 
treatment, the ACE becomes identifiable if the structural assignment for Y is 
additive in X and H [Wang and Tchetgen Tchetgen, 2016, Theorem 1]. 


(iv) For identifiability in the continuous case, see Newey [2013] and references 
therein. 


Most concepts involving instrumental variables, such as the linear setting described 
previously, extend to situations, in which observed covariates W cause some (or all) 
relevant variables. For example, in Figure 9.2 (left) we can allow for a variable W 
pointing at Z, X, and Y. The assumptions (a), (b), and (c), as well as the procedures, 
are then modified and always include conditioning on W. Brito and Pearl [2002b] 
extend the idea to multivariate Z and X (“generalized instrumental variables”). 


9.4 Conditional Independences and Graphical 
Representations 


In causal learning, we are trying to reconstruct the causal model from observational 
data. We have seen several identifiability results that allow us to identify the graph 
structure of an SCM over variables X from the observational distribution Px. Let 
us now turn to an SCM € over variables X = (O,H) that includes observed vari- 
ables O and hidden variables H. We may then still ask whether the graph structure 
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of € becomes identifiable from the distribution Po over the observed variables, and 
if so, how we can identify it. 

In the case without hidden variables, we discussed in Section 7.2.1 how one 
can learn (parts of) the causal structure under the Markov condition and faith- 
fulness. These assumptions guarantee a one-to-one correspondence between d- 
separation and conditional independence, and we can therefore test for conditional 
independence in Px and reconstruct properties of the underlying graph. Recall 
that independence-based methods, in principle, search over the space of DAGs and 
output a graph (or an equivalence class of graphs) representing exactly the set of 
conditional independences found in the data. 

For causal learning with hidden variables, we would in principle like to search 
over the space of DAGs with latent variables. This comes with additional difficul- 
ties, however. We do not know the size of H and if we therefore do not restrict 
the number of hidden variables, there is an infinite number of graphical candidates 
that we have to search over. Furthermore, there is a statistical argument against this 
approach: the set of distributions that are Markovian and faithful with respect to 
a DAG forms a curved exponential family, which justifies the use of the BIC, for 
example [Haughton, 1988]; the set of distributions that are Markovian and faithful 
with respect to a DAG with latent variables, however, does not [Geiger and Meek, 
1998]. If searching over DAGs with latent variables is infeasible, can we instead 
represent each DAG with latent variables by a marginalized graph over the ob- 
served variables, possibly using more than one type of edge, and then search over 
those structures? We have seen in Section 9.1 that such an approach also comes 
with a difficulty: the marginalized graph should depend on the original underlying 
SCM, and it is not sufficient to consider the information contained in the original 
graph. As mentioned previously, Bongers et al. [2016] studies marginalizations of 
SCMs in more detail. 

For these reasons, we consider in the remainder of this section a slightly shifted 
problem: instead of checking whether a full distribution could have been induced 
by a certain DAG structure with latent variables, we restrict ourselves on certain 
types of constraints. For example, we consider all distributions that satisfy the 
same set of conditional independence statements over the observed variables O 
(implicitly assuming the Markov condition and faithfulness). We then ask how we 
can represent this set of conditional independences. 

A straightforward solution would be to assume that the entailed distribution Po 
is Markovian and faithful with respect to a DAG without hidden variables, and, 
similarly as before, then output a class of DAGs that represents the conditional 
independence in the distribution of the observed variables. Representing the con- 
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Figure 9.3: Starting with an SCM on the left-hand side, the three graphs on the right encode 
the set of conditional independences (A JL C). Due to an erroneous causal interpretation, 
the DAG is not desirable as an output of a causal learning method. In this example, the 
IPG and the latent projection (ADMG) are equal to the MAG. 


(H) 


o Ga 


Figure 9.4: This example is taken from Richardson and Spirtes [2002, Figure 2(i)]. It 
shows that DAGs are not closed under marginalization. There is no DAG over nodes 
O = {A,B,C,D} that encodes all conditional independences from the graph including H. 


ditional independence structure Po with a DAG has two well-known drawbacks: 
(1) Representing the set of conditional independences with a DAG over the ob- 
served variables can lead to causal misinterpretations, and (2), the set of distribu- 
tions whose pattern of independences correspond to the d-separation statements in 
a DAG is not closed under marginalization [Richardson and Spirtes, 2002]. 

For (1), consider an SCM that entails a distribution P4 2.c,4 that is Markovian and 
faithful with respect to the corresponding DAG shown in Figure 9.3 (left). The only 
(conditional) independence relation that can be found in the observed distribution 
P4 gc is A IL C and therefore the DAG in Figure 9.3 (second from left) represents 
this conditional independence perfectly; in this sense, it could be seen as the output 
of PC. The causal interpretation, however, is erroneous. While in the original SCM 
an intervention on C does not have any effect on B, the output of PC suggests 
that there is a causal effect from C to B. Regarding (2), Figure 9.4 (it shows a 
graph that is taken from Richardson and Spirtes [2002]) shows the structure of an 
SCM over variables X = (O,H) whose distribution is Markovian and faithful with 
respect to a DAG G (G represents all conditional independences in X), that satisfies 
the following property. There are no DAGs over O representing the conditional 
independences that can be found in Po. In this sense, DAGs are not closed under 
marginalization. 
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The following subsection discusses some ideas that suggest graphs (over O) for 
representing conditional independences. Note, however, that they do not necessar- 
ily come with an intuitive causal meaning. It may be difficult to infer properties 
of the structure of the underlying SCM over X = (O,H) from the graphical ob- 
jects. Graphical criteria for adjustment, as in Section 6.6, for example, need to be 
developed and proved for each type of graph again. 


9.4.1 Graphs 


Before, we have used graphs to represent the structural relationships of SCMs; see 
Definitions 3.1 and 6.2. The goal of this section is different: here, the aim is to 
use graphs to represent constraints in the distribution induced by the SCM. In this 
Section 9.4, we mainly consider conditional independence relations and discuss 
other constraints in more detail in Section 9.5. We have seen that in the presence 
of hidden variables, DAGs are a poor choice for representing conditional indepen- 
dences. These shortcomings of DAGs initiated the development of new graphical 
representations in causal inference. Richardson and Spirtes [2002] introduce max- 
imal ancestral graphs (MAGs), for example, and show that they form the smallest 
superclass of DAGs that is closed under marginalization (see the preceding discus- 
sion). These are mixed graphs and contain directed and bidirected edges.” MAGs 
come with a slightly different separation criterion: instead of d-separation, one now 
looks at m-separation [Richardson and Spirtes, 2002]. Then, for each DAG with 
hidden variables there is a unique MAG over the observed variables that represents 
the same set of conditional independences (by m-separation); a simple construction 
protocol is provided in Richardson and Spirtes [2002, Section 4.2.1], for an exam- 
ple see Figure 9.3. This mapping is not one-to-one. Each MAG can be constructed 
by infinitely many different DAGs (containing an arbitrary number of hidden vari- 
ables). As for DAGs, the Markov condition relates graphical separation statements 
in a MAG with conditional independences. Different MAGs representing the same 
set of m-separation, are summarized within a Markov equivalence class [Zhang, 
2008b]; this equivalence class itself is often represented by a partially ancestral 
graph (PAG); see Table 9.1 for an overview. In PAGs, edges can end with a cir- 
cle, which represents both possibilities of an arrow’s head and tail; see Figure 9.3. 
Ali et al. [2009] provide graphical criteria that determine whether two MAGs are 
Markov equivalent. 


2In fact, they may even contain undirected edges and can therefore model selection bias. We refer 
to Richardson and Spirtes [2002] for details. 
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Example 9.5 (Y-structure) Given that even a single MAG can represent an ar- 
bitrary number of hidden variables, one may be wondering, whether a PAG, con- 
structed from a DAG with hidden variables, ever contains non-trivial causal infor- 
mation. In Figure 9.3, for example, the PAG does not specify whether there is a 
directed path between C and B or a hidden variable with directed path both into 
C and B. Figure 9.2 (right) shows the example of a Y-structure (Z1, Z2, and Y are 
not directly connected). Consider now an SCM over an arbitrary number of vari- 
ables that contains four variables X, Z1, Z2, and Y over which it induces the same 
conditional independences as the Y-structure does. We can then conclude that the 
corresponding PAG contains a directed edge from X — Y. In addition, the causal 
relation between X and Y has to be unconfounded [e.g., Mani et al., 2006, Spirtes 
et al., 2000, Figure 7.23]. Any SCM, in which X and Y are confounded or in which 
X is not an ancestor of Y, leads to a different set of conditional independences. 


We have mentioned that graphical objects such as MAGs are primarily con- 
structed to represent conditional independences and not to visualize SCMs (this 
is how we have introduced graphs in Definition 3.1). Thus, causal semantics be- 
comes more complicated. In a MAG, for example, an edge A — B means that in 
the underlying DAG (including the hidden variables), A is an ancestor of B and B 
is not an ancestor of A; that is, the ancestral relationships are preserved. The PAG 
in Figure 9.3, for example, should be interpreted as follows: “In the underlying 
DAG, there could be a directed path from C to B, a hidden common cause, or a 
combination of both.’ As a consequence, causal reasoning in such graphs, that 
is, computing intervention distribution, becomes more involved, too [e.g., Spirtes 
et al., 2000, Zhang, 2008b]. Perkovic et al. [2015] characterize valid adjustment 
sets (Section 6.6) that work not only for DAGs but also for MAGs. 

As an alternative to MAGs and PAGs, one may consider induced path graphs 
(PGs) and (completed) partially oriented induced path graphs (POIPGs) that 
can be used for representing sets of IPGs [Spirtes et al., 2000, Section 6.6]. These 
graphs have initially been used to represent the output of the fast causal inference 
(FCI) algorithm; see Section 9.4.2. Consider a distribution that is Markovian and 
faithful with respect to a MAG. Since every MAG is an IPG but not vice versa, the 
Markov equivalence class of the MAG is contained in the Markov equivalence class 
of the corresponding IPG, and thus a PAG usually contains more causal information 
than a POIPG [Zhang, 2008b, Appendix A]. 

Even yet another possibility is to start with the original DAG containing hidden 
variables and then apply a latent projection [see Pearl, 2009, Verma and Pearl, 
1991, Definition 2.6.1 and “embedded patterns”, respectively]. This operation 
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takes a graph G with observed and hidden variables and constructs a new graph- 
ical object G over the observed variables. The precise definition can be found in 
Shpitser et al. [2014, Definition 4], for example. The resulting graph structure is 
called an acyclic directed mixed graph (ADMG) and contains both directed and 
bidirected edges. Again, the m-separation leads to a Markov property [Richardson, 
2003]. Instead of searching over DAGs with latent variables, we may now search 
over ADMGs. 

We will see in Section 9.5 that distributions over the observed variables from a 
DAG with latent variables satisfy constraints other than conditional independences. 
ADMGs obey the possibility to take some of those constraints into account in the 
following way. The idea is to define a nested Markov property [Richardson et al., 
2012, 2017, Shpitser et al., 2014], such that a distribution is nested Markovian 
with respect to an ADMG if not only some conditional independences hold that 
are implied by the graph structure but also other constraints; see Section 9.5.1, 
for example. It turns out that even the nested Markov property does not encode 
all constraints (in the discrete case they do encode all equality constraints, though 
[Evans, 2015]). We therefore have [Shpitser et al., 2014]: 


{Po : Poy induced by a DAG G with latent variables} 
C {Po : Po is nested Markovian with respect to corresponding ADMG} 
C {Po : Po is Markovian with respect to corresponding ADMG}. 


For ADMGs with discrete data and the ordinary Markov property, Evans and 
Richardson [2014] provide a parametrization. This parametrization can be ex- 
tended to nested Markov models and it can be used to compute (constraint) max- 
imum likelihood estimators [Shpitser et al., 2012]. ADMGs are called bow-free 
if between each pair of nodes there is only one kind of edge. For linear Gaus- 
sian models, this subclass of models allows for parameter identifiability [Brito and 
Pearl, 2002a]; additionally, there are algorithms that compute maximum likelihood 
estimates [Drton et al., 2009a] or perform causal learning [Nowzohour et al., 2015]. 

Chain graphs consist of directed and undirected edges and do not allow for 
partially directed cycles [Lauritzen, 1996, Section 2.1.1]. There is an extensive 
body of work on chain graphs; see, for example, Lauritzen [1996] for an overview 
and Lauritzen and Richardson [2002] for a causal interpretation. Note that for chain 
graphs, different Markov properties have been suggested [Lauritzen and Wermuth, 
1989, Frydenberg, 1990, Andersson et al., 2001]. 

Summarizing, the representation of constraints (so far, we have mainly talked 
about conditional independences) using graphs, in particular in the case of hidden 
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variables, is a non-trivial task that is still an active field of research; Sadeghi and 
Lauritzen [2014] relate several types of mixed graphs and discuss their Markov 
properties. Usually, the graphical objects and their corresponding separation crite- 
ria are complicated, and it is not trivial to relate the edges to the existence of causal 
effects (one may argue that nested Markov models are a step toward simplification 
though). It is surprising that despite all the difficulties in some situations (see the 
Y-structure in Example 9.5) we are still able to learn causal ancestral relationships. 


9.4.2 Fast Causal Inference 


We have seen that for structure learning a PAG might be a more sensible output than 
a CPDAG. Indeed, it is possible to modify the PC algorithm such that it outputs 
a PAG [Spirtes et al., 2000, Section 6.2]. While this simple modification of PC 
works fine for many examples, it is not correct in general. At each iteration, the PC 
algorithm considers a pair of (currently) adjacent nodes A and B, say, and searches 
for a set that d-separates them. To achieve considerable speedups, it searches only 
through subsets of the current neighbors of nodes A and B, based on Lemma 7.8 (ii) 
in Section 7.2.1. In the presence of hidden variables, however, restricting the search 
space to subsets of the set of neighbors is not sufficient anymore [Verma and Pearl, 
1991, Lemma 3]; Spirtes et al. [2000, Section 6.3] provide an example, for which 
the modified PC algorithm fails to find a d-separating set. 

The FCI algorithm [Spirtes et al., 2000] resolves this issue. It outputs a PAG rep- 
resenting several MAGs. Zhang and Spirtes [2005] and Zhang [2008a] prove that 
a slight modification of the original FCI algorithm is complete. That is, its output 
is maximally informative. If the conditional independences originate from a DAG 
with hidden variables, the output indeed represents the correct corresponding PAG. 

Several modifications of FCI lead to significant speedups. Spirtes [2001] sug- 
gests to restrict the size of the conditioning set (anytime FCI), and Colombo et al. 
[2012] reduce both the number of conditional independence tests and the size of 
the conditioning sets (really fast causal inference). Both algorithms can be slightly 
less informative than FCI. They are succeeded by FCI+, which is fast and complete 
[Claassen et al., 2013]. 

As an alternative, one might consider to score MAGs or equivalence classes of 
MAGs. Such scoring functions exist only for some classes of SCMs, such as linear 
SCMs [Richardson and Spirtes, 2002]; also, we are not aware of any efficient way 
of searching over this space of MAGs [Mani et al., 2006]. Silva and Ghahramani 
[2009] discuss a Bayesian approach for learning mixed graphs. 
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Figure 9.5: Any distribution that is Markovian with respect to this graph satisfies the 


Verma constraint (9.3), a non-independence constraint that appears in the marginal distri- 
bution over A, B, C, and D; the dashed variable H is unobserved [Verma and Pearl, 1991]. 


9.5 Constraints beyond Conditional Independence 


We have mentioned that models with hidden variables can lead to constraints that 
are different from conditional independence constraints. We will mention a few of 
them to develop an intuition what kind of constraints we can expect, but we mainly 
point to the literature for details; see also Kela et al. [2017] for recent work and 
references to much of the earlier work. 


9.5.1 Verma Constraints 


Verma and Pearl [1991] provide the example shown in Figure 9.5. Any distribution 
that is Markovian with respect to the corresponding graph allows for the following 
Verma constraint [e.g., Spirtes et al., 2000, Chapter 6.9]. For some function f we 
have 


Lal b|a)p(d\a,b,c) = f(c,d). (9.3) 


Unlike conditional independence constraints, (9.3) lets us decide whether or not 
there is a directed edge from A to D (note that in Figure 9.5 A and D cannot be 
d-separated). Although many open questions regarding those algebraic constraints 
remain, there has been progress in understanding when such constraints appear 
[Tian and Pearl, 2002]. Shpitser and Pearl [2008b] investigate the special subclass 
of dormant independences; these are constraints that appear as indepedendence 
constraints in intervention distributions. 

The question remains how one can exploit those constraints for causal learning. 
In the case of binary variables, for example, Richardson et al. [2012, 2017] and 
Shpitser et al. [2012] use nested Markov models for the parametrization of such 
models and provide a method for computing (constraint) maximum likelihood es- 
timators; see also Section 9.4.1. However, nested Markov models do not include 
all inequality constraints, which we discuss in the following section. 
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(b) Causal structure of a famous experi- 
ment used by quantum physicists to falsify 


assumptions of classical physics; see Sec- 
tion 9.5.2. 


(a) Causal structure where Z is called an 
instrument for X and enables some causal 
statements about the effect of X on Y. 


Figure 9.6: Two important examples of latent structures that entail inequality constraints. 
9.5.2 Inequality Constraints 


Marginalizing a graphical model over some of its variables induces a large set of 
inequality constraints [see, e.g., Kang and Tian, 2006, Evans, 2012, and references 
therein]. It would go beyond the scope of this book to mention all the known ones. 
Instead, we would like to point out the diversity of fields in which they have been 
applied. To this end, we consider two example DAGs containing observed and 
unobserved variables that appear in completely different contexts. Note that this 
section discusses only inequalities that refer to the observational distributions of 
observable variables while the literature contains also inequalities that relate ob- 
servational and intervention distribution of observable variables [see, e.g., Balke, 
1995, Pearl, 2009, Chapter 8], sometimes also under additional assumptions [Silva 
and Evans, 2014, Geiger et al., 2014]. While the former task aims at falsifying a 
hypothetical latent structure, the latter one admits statements about interventions 
given that the respective DAG is true. To show some inequalities concerning only 
observational probabilities, the causal structure in Figure 9.6(a) with binary vari- 
ables entails, for instance, that 


P(X =0,Y =0|Z =0)+ P(X =1,Y = 1|Z=1) <1. (9.4) 


Inequalities like this have been provided in the literature [Bonet, 2001, eq. (3)] to 
test whether a variable is instrumental. This DAG plays a crucial role in analyzing 
randomized clinical trials with imperfect compliance, where Z is the instruction to 
take a medical drug, X describes whether the patient takes the drug (assume this 
can be inferred from a blood test, for example), and Y whether the patient recovers 
[see, e.g., Pearl, 2009]. 

The causal structure shown in Figure 9.6(b) is known to entail, for instance, the 
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Clauser-Horne-Shimony-Holt (CHSH) inequality [Clauser et al., 1969]: 


E[XY|S = —1,7 =—1]+E[XY|S = —1,T = 1] 
+E|XY|S = 1,7 =—1]+E[XY|S =1,T =1] <2 (9.5) 


if X,Y,S,T take values in {—1,1}. Equation (9.5) is a generalization of Bell’s 
inequality [Bell, 1964]. The latent common cause may attain arbitrarily many 
values, just as the existence of a variable that d-separates {X,S} from {Y,T } im- 
plies (9.5). Remarkably, the CHSH inequality is violated in quantum physics in a 
scenario where one would intuitively agree that the underlying causal structure is 
the one in Figure 9.6(b). Two physicists A and B at different locations receive parti- 
cles from a common source described by H. Variables X and Y describe the results 
of dichotomous measurements performed on the particles received by A and B, re- 
spectively. S is a coin flip that determines which measurement out of two possible 
options is performed by A. Likewise, T is a coin flip determining the measurement 
performed by B. The unobserved common cause of X and Y is the common source 
of the particles received by A and B. According to a widely accepted interpreta- 
tion, the violation of (9.5) observed in experiments [Aspect et al., 1981], shows that 
there is no classical random variable H describing the joint state of the incoming 
particles such that {S,X} and {7,Y} are conditionally independent, given H. This 
is because the state of quantum physical systems cannot be described by values of 
random variables. Instead, they are density operators on a Hilbert space. 

Information-theoretic inequalities for latent structures have gained interest since 
they are sometimes easier to handle than inequalities that refer directly to proba- 
bilities [see, e.g., Steudel and Ay, 2015]. Chaves et al. [2014] describe a family 
of inequalities for the case of discrete variables that is not complete but can be 
generated by the following systematic approach. 

First, one starts with a distribution entailed by an SCM over d discrete variables 
X := (X),...,Xa). For a given joint distribution Py, x, we can define a function 


H:2* >R} 


such that H(X;,,...,Xj,) is the Shannon entropy? of (X;,,...,Xj,). Well-known 


3We write H(X;,,...,Xj,) instead of H (Xj, er J) for notational convenience and again per- 
form set operations on vectors. 
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properties of H are the elementary inequalities 


H(SU{X}) > H(S) (9.6) 
H(SU{X),Xe}) < H(SU{X;}) +H(SU{K}) 0.7) 
H(0) = 0, (9.8) 


where S denotes a subset of X. Inequalities (9.6) and (9.7) are known as mono- 
tonicity and submodularity conditions, respectively; see also Section 6.10. Further- 
more, inequalities (9.6)—-(9.8) are known as polymatroid axioms in combinatorial 
optimization, too. 

To employ the causal structure, we now recall that S JL T |R for all three disjoint 
subsets S, T, and R of nodes, for which § and T are d-separated by R. This can be 
rephrased in terms of Shannon mutual information [Cover and Thomas, 1991] by 


I(S:T|R)=0, (9.9) 
which is equivalent to 
H(SUR)+A(TUR) =HA(SUTUR)+A(R). (9.10) 


Remarkably, (9.10) is a linear equation. Since conditional independences define 
nonlinear constraints on the space of probability vectors, it is more convenient to 
consider the constraints on the space of entropy vectors. 

These elementary inequalities together with Equation (9.9) imply further inequal- 
ities. To derive them in an algorithmic way, Chaves et al. [2014] use a technique 
from linear programing, the Fourier-Motzkin elimination [Williams, 1986]. Given 
some subset O C X of observed variables, this procedure often yields inequalities 
containing only entropies of variables in O although there may be no conditional 
independence constraints that contain only the observed ones. One example is 
given in Figure 9.7, for which Chaves et al. [2014, Theorem 1] obtain 


I(X :Z)+1(Y : Z) < H(Z), (9.11) 


and likewise for cyclic permutations of the variable names. A joint distribution 
violating (9.11) is, for instance, the one where all observed variables are 0 or all 
variables are 1 with probability 1/2 each because then H (Z) = 1 bit and /(X : Z) = 
I(Y : Z) = 1 bit. To understand this intuitively, note that in this example, we require 
for each observed node, say Z, a deterministic relationship with both X and Y and 
therefore with U and V. But there is a trade-off between the extent to which Z can 
be determined by its unobserved cause U or by V. Z cannot perfectly follow the 
“instructions” of both U and V simultaneously (which, themselves, are indepen- 
dent). 
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Figure 9.7: DAG that is not able to generate a joint distribution over X,Y, and Z, for which 

all three observed variables attain simultaneously 0 or 1 with probability 1/2 each. 
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Figure 9.8: If the graph corresponds to a linear SCM, the entailed distribution will satisfy 
the tetrad constraints (9.12)-(9.14). 


9.5.3 Covariance-Based Constraints 


Another type of constraint appears in linear models with hidden variables. For ex- 
ample, in Figure 9.8 we obtain the tetrad constraints [Spirtes et al., 2000, Spear- 
man, 1904]: 


PacPBp — PapPsc = 0 (9.12) 
PABPCD — PADPBc = 0 (9.13) 
PACPBD — PaBPcp = Q, (9.14) 


where pyc is the correlation coefficient between variables A and C. The first con- 
straint (9.12), for example, can be verified easily from Figure 9.8: 


cov|A,C]-cov|B, D] = ayn var[H] - pôn var[H] 
= 067 var|H] - Byn var|H] = cov[A, D] - cov[B,C]. 


It is possible to characterize the occurrence of vanishing tetrad constraints graph- 
ically using the language of treks and choke points [Spirtes et al., 2000, Theorem 
6.10]. Again, these constraints allow us to distinguish between different causal 
structures, just from observational data. Bollen [1989] and Wishart [1928] con- 
structed statistical tests to test for vanishing tetrad differences. These can be turned 
into a score that can be exploited for causal learning; this has been investigated by 
Spirtes et al. [2000, Chapter 11.2] and Silva et al. [2006], for example. 
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Kela et al. [2017] consider latent structures where all dependences between ob- 
served variables are due to a collection of independent common causes and de- 
scribe constraints on the possible covariance matrix of the observed variables. They 
emphasize that resorting to covariance matrices instead of the full distribution is 
advantageous both regarding statistical feasibility and computational tractability. 
Using functions of the observed variables (i.e., by mapping them into a feature 
space like in methods based on reproducing kernel Hilbert spaces), the method is 
also able to account for higher-order dependences. 


9.5.4 Additive Noise Models 


We have mentioned in Section 7.2.3 that learning the structure of LINGAMs can 
be based on ICA. Hoyer et al. [2008b] show that both identifiability statements and 
methods can be extended to linear non-Gaussian structures with hidden variables 
by exploiting what is known under overcomplete ICA. 

For nonlinear ANMs (Section 4.1.4), we have seen that in the generic case, we 
cannot have both Y = f(X) +Ny with Ny lL X and X = g(Y)+My with My ILY. 
We expect that a similar identifiability holds for hidden variables. The following 
ANM describes the influence of a hidden variable H on the observables X and Y: 


H := Ny (9.15) 
X := f(H) +Ny (9.16) 
Y := g(H)+Ny. (9.17) 


For the regime of sufficiently low noise, Janzing et al. [2009a] prove that the joint 
distribution Py x,y can be reconstructed from Py y up to reparametrizations of H. 
It is plausible that the restriction to low noise is not necessary but just a weakness 
of the proof. Setting f(H) = H and Ny = 0 yields an ANM from X to Y (and 
likewise, we can obtain an ANM from Y to X); this suggests that the additive noise 
assumption renders the three cases X > Y, X + Y, and X + * —> Y distinguishable 
from Py y alone. A relation to dimensionality reduction helps us to understand how 
we can fit the model (9.15)—(9.17) from data: data points (x,y) from the distribu- 
tion Py y can be drawn using the following procedure (see Figure 9.9): 
1. Draw h according to Py. 


2. Consider the corresponding point ( f(h), g(h)) on the manifold 
M:={ (f(h),g(h)) ER? :heR}. (9.18) 


3. Add some independent noise (ny ,ny) in each dimension. 
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Figure 9.9: The figure shows a scatter plot for Py y. The red line describes the manifold 
M; see Equation (9.18). 


To fit model (9.15)-(9.17) to a data sample from Py y, we may therefore apply a 
dimensionality reduction technique to the sample to obtain the estimate M. For 
recovering the corresponding value of h from a given point (x,y), this point (x,y) 
should not be projected onto the manifold M because this usually leads to residuals 
that will be dependent on H. Instead of small residuals (ny,ny), we require the 
residuals to be as independent as possible from H [Janzing et al., 2009a]. 

There are many remaining open questions regarding the identifiability of ANMs 
with hidden variables. Such results could have an important implication, however: 
whenever we find an ANM from X to Y but not from Y to X, these identifiability 
results would show that the effect is not confounded (within the model class of 
additive noise). 


9.5.5 Detecting Low-Complexity Confounders 


Here we explain two methods by Janzing et al. [2011] that infer whether the path 
between two observed variables X and Y is intermediated by some variable that 
attains only a few values; see Figure 9.10. The scenario is the following: X is 
causally linked to Y via a DAG that has an arrowhead at Y. The question is whether 
the path between X and Y is intermediated by a variable U that has only a few 
values. Here, the direction of the arrow that connects X and U does not matter, 
but the typical application of the method would be to detect confounding if the 
confounding path is intermediated by a variable U of this simple type. Janzing 
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OGO @+H-© 
Figure 9.10: Detecting low-complexity intermediate variables: if the path between X and 


Y is blocked by some variable U that attains only a few values, Pyjx often shows typically 
properties as a “fingerprint” of U. 


et al. [2011] consider, for instance, two binary variables X and U describing genetic 
variants (single-nucleotide polymorphisms) of an animal or plant and a variable Y 
corresponding to some phenotype. Whenever the statistical dependence between 
X and Y is only due to the fact that U has an influence on Y and U is statistically 
related to X, then U would play the role of such an intermediate variable. Here, 
neither U nor X is a cause of the other, but there are variables like “ethnic group” 
that influence both. Therefore, U is not the common cause itself, but it lies on the 
confounding path. 

The idea of detecting this type of confounding is that U changes the conditional 
Py|x in a characteristic way. To discuss this, we first define a class of conditionals 
of which we will later show that it will usually occur only if the path between X 
and Y is not intermediated by sucha U. 


Definition 9.6 (Pairwise pure conditionals) The conditional distribution Py|x is 
said to be pairwise pure if for any two x1,x2 E€ & the following condition holds. 
There is no A < 0 or A > 1 for which 


APyixax, + (1 -A)Prixaey (9.19) 
is a probability distribution. 


To understand Definition 9.6, note that (9.19) is always a probability distribution 
for A € [0,1] because it is then a convex sum of two distributions. On the other 
hand, for A ¢ [0,1], (9.19) may no longer be a non-negative measure: consider 
the case where Y attains finitely many values V := {y1,...,yg}. Then the space 
of distributions of Y is the simplex whose k vertices are given by the point masses 
on yj,.-.,¥x- Figure 9.11 shows this for the case k = 3, where the space of prob- 
ability distributions on y is a triangle. Figure 9.11(a) shows an example of a pure 
conditional: extending the connecting line between Py\y—,, and Py|y—,, leaves the 
triangle, while such an extension within the space of distributions is possible in 
Figure 9.11(b). Figure 9.12 shows, however, that purity is stronger than the condi- 
tion that the points Py|y—, lie in the interior of the simplex. Here, they are on the 
edges of the triangle and yet allow for an extension within the triangle. 
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(a) Example of a pure conditional: ex- (b) Example of a non-pure conditional: the 
tending the line connecting the two points line connecting Pyjy—,, and Py|y_,, can be 
Py|x=x, and Pyjy—,, would leave the sim- slightly extended without leaving the sim- 
plex of probability distributions. plex. 


Figure 9.11: Visualization of a pure and a non-pure conditional. 


If Py|x has a density (x,y) ++ p(y|x) purity can be defined by the following intu- 
itive condition: 
nf PORD 
vey p(y|x2) 
To explore to what extent causal conditionals corresponding to X — Y in nature 
are pure has to be left to future research. To give an example of an interesting class 
of pure conditionals, we want to mention that Pyy is pairwise pure if it admits an 
ANM with bijective function fy [Janzing et al., 2011, Lemma 4] and the density of 
the noise satisfies a certain decay condition. 
The following result shows that a pure conditional strongly suggests that the 
causal path between X and Y is not intermediated by a variable that attains only 
a few values. 


=0 Vx, 2 EX. 


Theorem 9.7 (Strictly positive conditionals and non-purity) Assume there is a 
variable U such that X JL Y |U. Further, assume that the range U of U is finite 
and that the conditional density p(u|x) is strictly positive for all u € U and for all 
x such that Py|x—, is defined. Then, Py|x is not pairwise pure. 


Proof. It is easy to see that the conditional Pyy is not pairwise pure because 
inf,cy p(ulx1)/p(ulx2) A O for all x1,x2 for which Pyix=x; is defined. Due to 
plx) = Lu P(|4) p(ulx), the conditional Py|y is a concatenation of Pyy and Pyy 
and therefore also not pure because Py\y is not pure [see Janzing et al., 2011, 
Lemma 8]. 


Although the theorem holds for all finite variables, the second assumption of 
strict positivity of the conditional Py|x is much more plausible if U attains only a 
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Figure 9.12: Another example of a non-pure conditional: the line connecting Py|y—,, and 
Py|x=x, can be extended without leaving the simplex. 


few values. Otherwise, it may happen that there exist values u for which p(u|x) is 
so close to 0 that this may result in Pyy being almost pure. 

To see an instructive example showing how the intermediate node typically spoils 
purity, assume that U and X are binary with p(u|x) = 1 — € for u =x. We then have 


Py|\x=9 = P(U = 0|X =0)Pyiyio + PU = 1X =0)Pyiy=1 
= (1 — £€)Pjju=0 + €Pyy=1- 


Hence, Pyix=o lies on the interior of the line connecting Pyjy—o and Py\y—; (and 
likewise for Py|y—1). Thus, Pyjx is not pure. 

Another example of how intermediate variables can leave characteristic “finger- 
prints” in the distribution of Py y will be formulated using the following property 
of a conditional [Allman et al., 2009, Janzing et al., 2011]: 


Definition 9.8 (Rank of a conditional) The rank of Pyy is the dimension of the 
vector space spanned by all vectors Py\x<_4 in the space of measures, where A runs 
over all measurable subsets of the range of X with non-zero probability. 


Janzing et al. [2011] does not provide an algorithm for estimating the rank, how- 
ever. If Y has finite range, Py|y defines a stochastic matrix whose rank coincides 
with the rank of Py). The following result is a simple observation [Allman et al., 
2009]: 


Theorem 9.9 (Rank and the range of U) Jf X ILY |U and U attains k values, 
then the rank of Py\x is at most k. 


It is easy to show that under the conditions of Theorem 9.9, Py y can be decom- 
posed into a mixture of k product distributions. This observation generalizes to the 
multivariate case: whenever there is a variable U attaining k values such that con- 
ditioning on U renders X),...,Xq jointly independent, then Py,,..x, decomposes 
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into a mixture of d product distributions. Sgouritsa et al. [2013] and Levine et al. 
[2011] describe methods to find this decomposition with the goal of detecting the 
“confounder” U via identifying the product distributions. 


9.5.6 Different Environments 


The invariant causal prediction approach we describe in Sections 7.1.6 and 7.2.5 
can be modified to deal with hidden variables [Peters et al., 2016, Section 5.2], 
as long as the hidden variables are not affected by interventions. Furthermore, 
Rothenhausler et al. [2015, “backShift”] consider the special case of linear SCMs. 
Assume that we observe a vector X° of d random variables in different environ- 
ments e € E. Here, the environments are generated by (unknown) shift variables 
C! = (Cj, . . .,C$) that are required to be independent of each other and of the noise 
variables. That is, for each environment e we have 


X? = BX +C +N", 


where the distribution of N° does not depend on e. We can allow for hidden vari- 
ables by assuming non-zero covariance between the different components of the 
noise variables. It still follows that 


(I — B)Ex (1 — B)" = Ece +n 


with Xx e, Lc, and Ly being the covariance matrices of X°, C°, and N°, respec- 
tively. Ergo, 
(I—B) (£x e — Ex f) (I-B)’ = Ece — Ecs. (9.20) 


(Note that for each environment e, one may pool all other environments to obtain 
the “environment” f.) By assumption, for all choices of e and f, the right-hand side 
of Equation (9.20) is diagonal, which allows us to reconstruct the causal structure B 
by joint diagonalization of Xx e — Łx,f. If there are at least three environments, this 
procedure allows us to identify B under weak assumptions [Rothenhäusler et al., 
2015, Theorem 1]. 

The latter example shows how imposing regularity conditions (as linear models 
and independent shift interventions) among different environments, allows us to 
reconstruct the underlying causal structure even in the presence of hidden variables. 


9.6 Problems 


Problem 9.10 (Sufficiency) Prove Remark 9.4. 
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Problem 9.11 (Simpson’s paradox) Construct an SCM € with binary random 
variables X, Y and a sequence Z,,Z2,... of variables, such that for all even d > 0 
and all z1,...,Zd+1, 


Po =1|X =1,Z1 =e. Zee) 
SPY 1|X 0,2) Zipes Zd = ta) 


but 


PY siZ Stipo =E] 
< PUY =1|X =0,2) = 21,0 -, Z4 = Zd, Ze = za): 


This example drives the Simpson’s paradox to an extreme. If X indicates treatment, 
Y recovery, and Z1, Z2,... some confounding factors, then, by the adjustment for- 
mula (6.13), adjusting for more and more variables always turns around the causal 
conclusion whether the treatment is helpful or harmful. 


Problem 9.12 (Instrumental variables) Consider the SCM 


H := Ny 

Z:= Nz 

X := f(Z)+g(H) +Nx 
Y :=aX+j(H)+Ny 


and assume that we observe the joint distribution over Z, X, and Y. Given the dis- 
tribution rather than a finite sample, regressing X on Z non-parametrically yields 
the conditional mean E|X |Z = z] as regression function. Write down the two-stage 
least square method and prove that it identifies a. 


10 


Time Series 


Reasoning about causal relations among variables that refer to different time in- 
stances is easier than causal reasoning without time structure. Causal structures 
have to be consistent with the time order. We have seen in Section 7.2.4 that, after 
knowing a causal ordering of nodes and assuming that there are no hidden vari- 
ables, finding the causal DAG does not require assumptions other than the Markov 
condition and minimality (more debatable conditions such as faithfulness or re- 
stricted function classes, for instance, are not necessary). Given the time order, 
three main issues remain. First, the set of variables under consideration may not 
be causally sufficient; second, there may be variables that refer to the same time 
instant (within the given measurement accuracy) that cannot be causally ordered a 
priori; third, in practice, we are often given only one repetition of the time series — 
this differs from the usual i.i.d. setting, in which we observe every variable several 
times. Accordingly, all these issues play a crucial role for causal reasoning in time 
series. 


10.1 Preliminaries and Terminology 


So far, we have considered a setting where samples are i.i.d. drawn from the joint 
distribution Py,,..x,. Here, we discuss causal inference in time series, that is, 
we have a d-variate time series (X;)rcz, where each X, for fixed ¢ is the vector 
(X},..., X£). We assume that it describes a strictly stationary stochastic process 
[e.g., Brockwell and Davis, 1991]. Each variable Xj represents a measurement of 
the jth observable of some system at time f. Since causal influence can never go 
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Figure 10.1: Example of a time series with no instantaneous effects. 
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Figure 10.2: Example of a time series with instantaneous effects. 


> 

—> 
from the future to the past, we distinguish between two types of causal relations in 
multivariate time series. , 

First, the causal graph! with nodes Xj for (j,t) € {1,...,d} x Z contains only 
arrows from X/ to X* for t < s but not for t = s; see Figure 10.1. Then we say there 
are no instantaneous effects. Second, the causal graph contains instantaneous 
effects, that is, arrows from X/ to X* for some j and k in addition to arrows going 
from X?” to X! for t < s and some m and £, as shown in Figure 10.2. We call the 


causal structure purely instantaneous if for any j 4 k and h > 0 the variable x} 


may influence X¥ and xe h but not XÉ v see Figures 10.5(a) and 10.5(b). The case 


where each Xj is not influenced by any previous variable (including its own past), 
can be ignored because it need not be described as time series. Instead, the index t 
may then be considered as labeling indices of independent instances of a statistical 
sample in the i.i.d. setting of previous chapters. 

We define the full time graph as the DAG having X/ as nodes, as visualized in 


l Strictly speaking, we have introduced the causal DAG only for finitely many nodes so far. Here, 
however, we need infinite graphs and neglect this technical subtlety [see, e.g., Peters et al., 2013]. 
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C)—@)—&) 
Figure 10.3: Summary graph of the full time graphs shown in Figures 10.1 and 10.2. 


Figures 10.1 and 10.2. In contrast to previous chapters, the full time graph is a 
DAG with infinitely many nodes. The summary graph is the directed graph with 
nodes X!,... ,X d containing an arrow from X/ to X* for j Æ k whenever there is an 
arrow from X/ to X* for some t < s € Z. Note that the summary graph is a directed 
graph that may contain cycles, although we will assume that the full time graph 
is acyclic. Figure 10.3 shows the summary graph corresponding to the full time 
graphs depicted in Figures 10.1 and 10.2. For any ¢ € Z, we denote by Xpast(r) the 


set of of all X, with s < t and use X j for the past of a specific component XÏ. 
past 


(t) 
We also write Ke if t is some fixed time instant of reference. Moreover, (X; ”);¢z 


denotes the collection of time series (X*),<z for all k £ j. 


10.2 Structural Causal Models and Interventions 


We assume that the stochastic process (X;);cz admits a description by an SCM in 
which at most the past q values (for some q) of all variables occur: 


X = fi (PAi); ea (PAT),—1, (PA§),,/) , (10.1) 


where 
1 d 1 d 1 d 
saa Neies AN Ne sos Ny Nikes N epee 


are jointly independent noise terms. Here, for each s € Z, the symbol (PA/ = 
denotes the set of variables X* ,, k = 1,...,d, that influence X/. Note that PAJ, 
may contain as for all s > 0, but not for s = 0. We assume the corresponding full 
time graph to be acyclic. 

A popular special case of (10.1) is the class of vector autoregressive models 
(VAR) [Liitkepohl, 2007]: 


; Gs ; 
X/ =) A/X,-i + N7, (10.2) 
i=l 


where each Al is a 1 x d-matrix; see also Remark 6.5 on linear cyclic models, 
especially Equation (6.4). 
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Figure 10.4: Example of a subsampled time series: only the variables in the shaded areas 
are observed. 


As in the i.i.d. setting, SCMs formalize the effect of interventions; more pre- 
cisely, an intervention corresponds to replacing some of the structural assignments. 
Interventions may, for instance, consist in setting all values {X/ Wez for some j to 
certain values. Alternatively, one could also intervene on Xj’ only at one specific 
time instant t. 


10.2.1 Subsampling 


In many applications, the sampling process may be slower than the time scale of 
the causal processes. Figure 10.4 shows an example, in which only every second 
time instance is observed. The summary graph of the original full system contains 
the edges X! — X? — X?. We may now want to construct a causal model for the 
observed, subsampled processes. It is therefore important to define which inter- 
ventions we want to allow for. First, if we constrain ourselves to interventions on 
observed time points, there should be no causal influence from X! to X?. Interven- 
ing on an observed instance of X! does not have any effect on the observable part 
of X? (note that the time series X! has only lag two effects X} > Xa): Further- 
more, in this setting, subsampling cannot create spurious instantaneous effects if 
these have not been there before. For the case of an SCM, Bongers et al. [2016, 
Chapter 3] describe a formal process of how to marginalize the model by substi- 
tuting the causal mechanisms of the hidden time steps into the other mechanisms. 
The resulting model describes the effect of interventions correctly if these are re- 
stricted to the observed time points. Second, if we do consider interventions on 
hidden variables, however, we may be interested in recovering the original sum- 
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mary graph, a problem that is addressed by Danks and Plis [2013] and Hyttinen 
et al. [2016], for example. 

There are situations in which subsampling is not a good model for the data- 
generating process. For many physical measurements, for example, one may want 
to model the observations as averages of consecutive time points rather than as a 
sparse subset of those. The former is a useful but also complicated model assump- 
tion: the averaging process might change the model class, and one furthermore 
needs to be careful about modeling interventions. 


10.3 Learning Causal Time Series Models 


Currently, Granger causality and its variations is among the most popular ap- 
proaches to causal time series analysis. To provide a better link among the chapters, 
we nevertheless first explain the conclusions that can be drawn using a conditional 
independence-based approach. The order should by no means be mistaken as a 
judgment about the approaches. 

Sections 10.3.1 and 10.3.2 contain mostly identifiability results. The remaining 
three Sections, 10.3.3, 10.3.4, and 10.3.5, contain more concrete causal learning 
methods for time series. They can be applied if the multivariate time series has been 
sampled once, at finitely many time points. Most of the ideas, however, transfer to 
situations, where we receive several i.i.d. repetitions of the same time series. 


10.3.1 Markov Condition and Faithfulness 


Lemma 6.25 states that two DAGs are Markov equivalent if and only if their skele- 
ton and their set of v-structures coincide. If there are no instantaneous effects, the 
full time graph is therefore already determined by knowing its skeleton. The arrow 
can only be directed forward in time. We thus conclude [Peters et al., 2013, Proof 
of Theorem 1]: 


Theorem 10.1 (Identifiabilty in absence of instantaneous effects) Assume that 
two full time graphs are induced by SCMs without instantaneous effects. If the full 
time graphs are Markov equivalent, then they are equal. 


Hence, we can uniquely identify the full time graph from conditional indepen- 
dences provided that Markov condition and faithfulness holds (to deal with in- 
finitely large DAGs, one sometimes assumes that the time series start at t = 0). 

In the presence of instantaneous effects, Markov equivalent graphs can at most 
differ by the direction of those effects. However, there are many cases where even 


Pee? YPPY 
O-O-O-O -O-O-O 


(a) There are v-structures at all nodes of (b) There are v-structures at all nodes of 
(Y:)rez- (X:)rez- 


Figure 10.5: Two DAGs that are not Markov equivalent although they coincide up to 
instantaneous effects. 


that direction can be identified because different directions of instantaneous effects 
induce different sets of v-structures. A simple example is shown in Figure 10.5. 
The direction of the instantaneous effect can still be inferred even if arrows from 
X; to Y,41 for all t € Z are added to Figure 10.5, and likewise if arrows from Y, to 
X;+1 are added; we cannot add both, however, because this would remove all v- 
structures. The following sufficient condition for the identifiability of the direction 
of instantaneous effects has been given by Peters et al. [2013, Theorem 1]: 


Theorem 10.2 (Identifiability for acyclic summary graphs) Assume that two 
full time graphs are induced by SCMs, and that in both cases for each j, X; is 
influenced by X} _, for some s > 1. Assume further that the summary graphs are 
acyclic. If the full time graphs are Markov equivalent, then they are equal. 


The following result shows that the presence of any arrow in the summary graph 
can in principle be decided from a single conditional independence test. 


Theorem 10.3 (Justification of Granger causality) Consider an SCM without 
instantaneous effects for the time series (X;)+ez such that the induced joint dis- 
tribution is faithful with respect to the corresponding full time graph. Then the 
summary graph has an arrow from X/ to X* if and only if there exists at € Z such 
that 


XP XS oy [Xo (10.3) 


past past(t)* 

For completeness, we have included the proof in Appendix C.14. Similar results 
can be found in White and Lu [2010] and Eichler [2011, 2012]. As already sug- 
gested by the headline of Theorem 10.3, this is the basis of Granger causality that 
we discuss in more detail in Section 10.3.3. 
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10.3.2 Some Causal Conclusions Do Not Require Faithfulness 


Remarkably, interesting causal conclusions can even be made from conditional 
dependences without using faithfulness. This is in contrast to the i.i.d. case where 
any distribution is Markovian with respect to the complete DAG for any ordering 
of nodes. Since there are no arrows backward in time, the Markov condition for 
time series is sufficient to infer whether the summary graph is X —> Y or Y > X, 
given that one of the two alternatives is true. 


Theorem 10.4 (Detection of arrow X — Y) Consider an SCM for the bivariate 
time series (X;,Y;)rez. 


(i) If there is at € Z such that 


Y, £ Xpast(t) | Ypast(t) ’ (10.4) 


then the summary graph contains an arrow from X to Y. 


(ii) Assume further that there are no instantaneous effects and the joint density 
of any finite subset of variables is strictly positive. If for all t € Z, we have 


Y, JL Xpast(t) | Ypast(t)> (10.5) 
then the summary graph contains no arrow from X to Y. 


Again, this proof may have appeared elsewhere, but we include it for complete- 
ness in Appendix C.15. Proving (ii) requires causal minimality, which is strictly 
weaker than faithfulness. 

In the next subsection we will see that Theorem 10.4 and various variations [e.g., 
White and Lu, 2010, Eichler, 2011, 2012] link conditional independence-based 
approaches to causal discovery to Granger causality. 


10.3.3 Granger Causality 


For simplicity, we start with the bivariate version of Granger causality. 


Bivariate Granger Causality Theorem 10.4 shows (subject to excluding instan- 
taneous effects together with mild technical conditions) that the presence or ab- 
sence of an arrow in the summary graph can be inferred by testing (10.5) and the 
analogous statement when exchanging the roles of X and Y. We can then distin- 
guish between the possible summary graphs X Y, X >Y,X +Y,X €Y. One 
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Figure 10.6: Typical scenario, in which Granger causality works: if all arrows from X 
to Y were missing, Y, would be conditionally independent of the past values of X, given 
its own past. Here, Y, does depend on the past values of X, given its own past. Thus, 
condition (10.4) proves the existence of an influence from X to Y. 


infers that X influences Y whenever the past values of X help in predicting Y from 
its own past. Formally, we write 


X Granger-causes Y :<=> Y, JL Xpast(t) | Ypast(r) + (10.6) 


This idea already goes back to Wiener [1956, pages 189-190], who argued that X 
has a causal influence on Y if the prediction of Y from its own past is improved by 
additionally accounting for X. The typical scenario, in which Theorem 10.4 holds 
is depicted in Figure 10.6. 

Often Granger causality refers to linear prediction. Then, one compares the fol- 
lowing two linear regression models: 


q 
Y, = } aY; +N: (10.7) 
i=1 
q q 7 
Y=) a¥ it ) bX +Ñ, (10.8) 
i=] i=1 


where (N;);cz and (N;);cz are assumed to be i.i.d. time series, respectively. X is 
inferred to Granger-cause Y whenever the noise term Ñ, (for predictions includ- 
ing X) has significantly smaller variance than the noise term N; obtained without 
X. This amounts to saying that Y, has non-vanishing partial correlations to Xpast(z), 
given Y,,<(r)- For multivariate Gaussian distributions, this is equivalent to the de- 
pendence statement (10.4). Modifications of this idea that use nonlinear regression 
have been extensively studied, too [e.g., Ancona et al., 2004, Marinazzo et al., 
2008]. For non-parametric testing of (10.5) see, for instance, Diks and Panchenko 
[2006] and references therein. 

An information theoretic quantity measuring the dependence between Y, and the 
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past of X, given the past of Y, is given by transfer entropy [Schreiber, 2000]: 
TE(X > Y) :=1(Y; : Xpast(t)|Ypast(e))» (10.9) 


where /(A : B |C) denotes the conditional mutual information [Cover and Thomas, 
1991] for any three sets A, B, C of variables; see also Appendix A. Estimat- 
ing transfer entropy and inferring that X causes Y whenever it is significantly 
greater than 0 can thus be considered as an information theoretic implementation 
of Granger causality that accounts for arbitrary nonlinear influences. It is therefore 
tempting to consider transfer entropy as a measure of the strength of the influence 
of X on Y, but “Limitations of Granger Causality” will explain why this is not 
appropriate. 


Multivariate Granger Causality The assumption of causal sufficiency of a bi- 
variate time series as in Theorem 10.4 is often inappropriate. This has already been 
addressed by Granger [1980]. We therefore say X/ Granger causes X* if 


k =J 
X, {X it )|X past(t)° 


Granger already emphasized that proper use of Granger causality would actually 
require to condition on all relevant variables in the world. Nevertheless, Granger 
causality is often used in its bivariate version or in situations, in which clearly 
important variables are unobserved. Such a use can yield misleading statements 
when interpreting the results causally. 


Limitations of Granger Causality Violation of causal sufficiency is — as in 
the i.i.d. scenario of the previous chapters — a serious issue in causal time series 
analysis. To explain why Granger causality is misleading in a causally insuffi- 
cient multivariate time series, we restrict the attention to the case where only a 
bivariate time series (X;,Y;)+ez is observed. Assume that both X; and Y, are influ- 
enced by previous instances of a hidden time series (Z;);cz. This is depicted in 
Figure 10.7(a) where Z influences X with a delay of 1, and Y with a delay of 2. 
Assuming faithfulness, the d-separation criterion tells us 


Y, K X past(t IY, past(t 


while we have 
X; IL Yoast(t) |X, past(t) - 
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(b) Granger causality erroneously infers 
neither causal influence from X to Y nor 


from Y to X if the influence from X; on Y;+1 


and the one from Y, to X;; are determinis- 
tic. 


(a) Due to the hidden common cause Z, 
Granger causality erroneously infers causal 
influence from X to Y. 


Figure 10.7: In these examples, Granger causality infers an incorrect graph structure. 


Thus, naive application of Granger causality infers that X causes Y and Y does not 
cause X. This effect has been observed, for instance, for the relation between the 
price of butter and the price of cheese. Both prices are strongly influenced by the 
price of milk, but the production of cheese takes much longer than the production of 
butter, which causes a larger delay between the prices of milk and cheese [Peters 
et al., 2013, Experiment 10]. This failure of Granger causality, however, is only 
possible because not all relevant variables are observed, which was stated as a 
requirement by Granger himself. 

A second example for a scenario where Granger fails has been provided by Ay 
and Polani [2008] and is depicted in Figure 10.7(b). Assume that X;—ı influences 
Y, deterministically via a copy operation, that is, Y, := X;_;. Likewise, the value 
of Y;_1 is copied to X;. Then it is intuitively obvious that X and Y strongly influ- 
ence each other in the sense that intervening on the value X, changes all the values 
Y;41+42 for k € No. Likewise, intervening on Y, changes all values X;+142%. Nev- 
ertheless, the past of X is useless for predicting Y, from its past, because Y, can 
already be predicted perfectly from its own past. Certainly, deterministic relations 
are in general problematic for conditional independence-based causal inference 
since determinism induces additional independences. For instance, if Y is a func- 
tion of X in the causal chain X —> Y — Z, we get Y IL Z|X, which is not typical 
for this causal structure. One may therefore argue that this example is artificial and 
a more natural version would be a noisy copy operation. For the case where X; 
and Y, are binary variables, Janzing et al. [2013, Example 7] show that the transfer 
entropy converges to 0 when the noise level of the copy operation tends to 0. Then, 
Granger causality would indeed infer that X causes Y and Y causes X, but for small 
noise the tiny amount by which the past of X improves the prediction of Y; does not 
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(b) Here, the past of X is still helpful for 
predicting Y, since X;_, influences Y, indi- 
rectly via X. Thus, Granger causality is 
still able to detect the influence of X on Y. 


(a) Granger causality cannot detect the in- 
fluence of X on Y because the past of X in- 
fluences Y; only via the past of Y. 


Figure 10.8: Two scenarios with instantaneous effects, one where Granger causality fails 
to detect them (a) and one where it does not (b). 


properly account for the mutual influence between the time series (which is still 
strong in an intuitive sense). In this sense, transfer entropy is not an adequate mea- 
sure for the strength of causal influence of one time series on another one. Janzing 
et al. [2013] discuss the limitations of different proposals to quantify causal influ- 
ence (both for time series and the i.i.d. setting) and propose another information 
theoretic measure of causal strength. To summarize this paragraph, we emphasize 
that the qualitative statement about presence or absence of causal influence in the 
case of two causally sufficient time series only fails for a rather artificial scenario, 
while quantifying the causal influence via transfer entropy (which is suggested by 
interpreting “improvement of prediction” in information theoretic terms) can be 
problematic also in less artificial scenarios. 

There is another scenario where Granger causality is quantitatively misleading 
but its qualitative statement remains correct unless faithfulness is violated (it uses, 
however, instantaneous effects, for which one may argue that they disappear for 
sufficiently fine time resolution [Granger, 1988]). For Figure 10.8(a), d-separation 
yields 


Y, JL Xpast(t) | Ypast(t)- 


Intuitively speaking, only the present value X; would help for better predicting 
Y,, but the past values X;_;,X;-2,... are useless and thus, Granger causality does 
not propose a link from X to Y. In Figure 10.8(b), however, Granger causality 
does detect the influence of X on Y (if we assume faithfulness) although it is still 
purely instantaneous, but the slight amount of improvement of the prediction does 
not properly account for the potentially strong influence of X, on Y,. To account 
for instantaneous effects, modifications of Granger causality have been proposed 
that add instantaneous terms in the corresponding SCM, but then identifiability 
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may break down [e.g., Ltitkepohl, 2007, (2.3.20) and (2.3.21)]. Knowing that a 
system contains instantaneous effects may suggest modifying Granger causality by 
regressing Y, in (10.8) not only on Xpast(r) but on X; U Xpast(r) instead. However, as 
already noted by Granger [1988], this may yield wrong conclusions: if X; helps in 
predicting Y,, this could equally well mean that Y, influences X; instead of indicating 
an influence from X; to Y;. 


Remark 10.5 (Model misspecification may help) There is a paradox message of 
this insight: even in the case in which variables influence other variables instanta- 
neously, for inferring causal statements it is more conclusive to check whether the 
past of a variable helps for the prediction rather than to check whether the past and 
the present value help. Condition (i) of Theorem 10.4 does not exclude instanta- 
neous effects. Therefore (subject to causal sufficiency), we can still conclude that 
every benefit of X past) for predicting Y, from Y,.() is due to an influence of X on 
Y. Moreover, whenever there is any influence of X on Y, no matter whether it is 
purely instantaneous or not, X past) Will in the generic case improve our prediction 
of Y,, given Ypast(t)- 


10.3.4 Models with Restricted Function Classes 


To address the limitations of Granger causality, Hyvärinen et al. [2008] describe 
linear non-Gaussian autoregressive models that render causal structures with in- 
stantaneous effects identifiable. Peters et al. [2013] describe how to address this 
task using less restrictive function classes f/ in (10.1). One example is given by 
adapting ANMs to time series, that is, to use the SCM 


xj = fi (PAi), an .(PAY),—1, (PAŻ):) +A, 


for j € {1,...,d}. Apart from identifiability of causal structures within Markov 
equivalence classes, there is a second motivation using restricted function classes: 
using simulated time series, Peters et al. [2013] provide some empirical evidence 
for the belief that time series that admit models from a restricted function class are 
less likely to be confounded. 


10.3.5 Spectral Independence Criterion 


The spectral independence criterion (SIC) is a method that is based on the idea 
of independence between cause and mechanism described in Shajarisales et al. 
[2015]. Assume we are given a weakly stationary bivariate time series (X;, Y; )rez 
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where either X influences Y or Y influences X via a linear time invariant filter. 
More explicitly, for the case that X influences Y, Y is then obtained from X by 
convolution with a function h: 


Y, = y A(k)X)—k- (10.10) 
k=1 


For technical details, such as the decay conditions for h that ensure that (10.10) 
and expressions below are well-defined, we refer to Shajarisales et al. [2015]. To 
formalize an independence condition between X and h, we consider the action of 
the filter in the frequency domain: for all v € [—1/2,1/2], let Syx(v) denote the 
power spectral density for the frequency v; the latter is explicitly given by the 
Fourier transform of the auto-covariance function 


Then, (10.10) yields 
Syy(V) = |A(v)|? -Sxx(v), (10.11) 


where h(V) = X rez e~? h(k) denotes the Fourier transform of h. In other words, 
multiplying the power spectrum of the input time series with the squared trans- 
fer function of the filter yields the power spectrum of the output. Whenever į is 
invertible, in addition to (10.11) we have 


2 


1 
Sxx(v) = | = -Syy(V). (10.12) 


h(v) 
While both equations (10.11) and (10.12) are valid, the question is which one de- 
scribes the causal model. The idea is that for the causal direction, the power spec- 
trum of the input time series carries no information about the transfer function of 
the filter. To formalize this, Shajarisales et al. [2015] state the following indepen- 
dence condition: 


Definition 10.6 (SIC) The time series X and the filter h are said to satisfy the SIC 
if Sxx and h are uncorrelated, that is, 


(Sxx: |A?) = (Sxx) + (l°), (10.13) 


where (f) := L f(v)dv denote the average of any function on the frequency 


interval |—1/2,1/2]. 


210 Chapter 10. Time Series 


Shajarisales et al. [2015] show that (10.13) implies that the analogue indepen- 
dence condition for the backward direction does not hold, except for the non- 
generic case where |/| is constant over the whole interval of frequencies. 


Theorem 10.7 (Identifiability via SIC) Jf (10.13) holds and lâl is not constant 
in v then Syy is negatively correlated with 1/\h|, that is, 


(Svr - 1/141) < (Svr) - (1/181). (10.14) 


Proof. The left-hand sides of (10.13) and (10.14) are given by (Syy) and (Syx), 
respectively. Jensen’s inequality states 1 /(|h|?) < (1/|h|?), which implies the state- 
ment. 


Shajarisales et al. [2015] propose a simple causal inference algorithm that checks 
which direction is closer to satisfying SIC. They report some encouraging results 
using SIC for experiments with various simulated and real-world data sets. 


10.4 Dynamic Causal Modeling 


Dynamic causal modeling (DCM) is a technique that has been developed particu- 
larly for inferring causal relations between the activities of different brain regions 
[Friston et al., 2003]. If the vector z € R” encodes the activity of n brain regions 
and u € R” a vector of perturbations, the dynamics of z is given by a differential 
equation of the form 

q Fe 9), (10.15) 
where F is a known function, u € R” is a vector of external stimulations, and 0 
parametrizes the model class describing the causal links between the different brain 
regions. One often considers the following bilinear approximation of (10.15): 


d m : 

—z= |A ‘Bi Cu, 10.16 

ae ( + L uj z+Cu ( ) 
where A,B!,...,B’” are n x n matrices and C has the size n x m. While A describes 


the mutual influence of the activities z; in different regions, the matrices Bİ describe 
how u changes their mutual influence. C encodes the direct influence of u on z. 
Here, z is not directly observable, but one can detect the hemodynamic response. 
The blood flow provides an increased amount of nutrients (such as oxygen and 
glucose) to compensate for the increased demand of energy. Functional magnetic 
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resonance imaging (fMRI) is able to detect this increase via the blood-oxygen- 
level-dependent (BOLD) signal. Defining a state vector x that includes both the 
brain activity and some hemodynamic state variables, one ends up with a differen- 
tial equation for x 


d 
Ht Fu 8) (10.17) 


by combining (10.16) with a dynamical model of the hemodynamic response. The 
high-dimensional parameter 0 consists of all free parameters of (10.16) and pa- 
rameters from modeling the hemodynamic response. Then, one uses a model of 
how x determines the measured BOLD signal y: 


Y= Ala). (10.18) 


Finally, as data, we obtain an observed time series of y-vectors. DCM then infers 
the matrices in (10.16) from these data using various known techniques for learning 
models with latent variables, for example, expectation maximization (EM). 

Lohmann et al. [2012a] criticize DCM mainly because the number of model pa- 
rameters explodes with growing n and m, which renders their identification im- 
possible from empirical data. According to their experiments with simulated brain 
connections, a large fraction of wrong models obtained higher evidence by DCM 
than the true model. These findings triggered a debate about DCM; see also Friston 
et al. [2013] for a response to Lohmann et al. [2012a] and Lohmann et al. [2012b] 
for a response to Friston et al. [2013]. 


10.5 Problems 


Problem 10.8 (Acyclic summary graphs) Prove Theorem 10.2. 


Problem 10.9 (Instantaneous effects) Consider an SCM over a multivariate time 
series, in which each variable xj is influenced by all past values of all compo- 
nents X*. Additionally, assume that the instantaneous effects form a DAG and that 
the distribution is Markovian and faithful with respect to the full time graph. To 
which extent can one identify the instantaneous DAG structure from the distribu- 
tion? 


Problem 10.10 (Granger causality) Argue why Granger causality results in “X 
G causes Y” and “Y G causes X” if one adds arrows Z, > Z4, for t € Z in 
Figure 10.7(a). 
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Some Probability and Statistics 


A.l Basic Definitions 


(i) We denote the underlying probability space by (Q, F, P). Here, Q, F, and P 
are set, o-algebra, and probability measure, respectively. 

(ii) We use capital letters for real-valued random variables. For example, X : 
(Q,F) — (R, Bp) is a measurable function, with respect to the Borel o- 
algebra. Random vectors are measurable functions X : (Q, F) > (Rf, Bga). 
We call X non-degenerate if there is no value ¢ € R? such that P(X =c) = 1. 
For an introduction to measure theory, see, for example, Dudley [2002]. 

(iii) We usually denote vectors with bold letters. In a slight abuse of notation, we 
consider sets of variables B C X as a single multivariate variable. 

(iv) Px is the distribution of the d-dimensional random vector X, that is, a prob- 
ability measure on (Rf, Bau). 

(v) We write x ++ px(x) or simply x +> p(x) for the density, that is, the Radon- 
Nikodym derivative of Py with respect to a product measure. We (sometimes 
implicitly) assume its existence or continuity. 

(vi) We call X independent of Y and write X IL Y if and only if 


P(x, y) = p(x)p(y) (A.1) 


for all x,y. Otherwise, X and Y are dependent, and we write X K Y. 
(vii) We call X1,...,Xq jointly (or mutually) independent if and only if 


px,- -.,Xa) =P 1) se p(xa) (A.2) 
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(viii) 


(ix) 


(x) 


(xi) 


(xii) 
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for all x,,...,xg. If X1,...,Xq are jointly independent, then any pair X; and 
X; with i 4 j are independent, too. The converse does not hold in general: 
pairwise independence does not imply joint independence. 
We call X independent of Y conditional on Z and write X JL Y |Z if and 
only if 

P(x,y|z) = p(x|z)p(y|z) (A.3) 


for all x,y,z such that p(z) > 0. Otherwise, X and Y are dependent condi- 
tional on Z and we write X Jf Y |Z. 

Conditional independence relations obey the following important rules [e.g., 
Pearl, 2009, Section 1.1.5]: 


ALY |Z. = Y ILX|Z (symmetry) 
XILYW|Z > XLyY|Z (decomposition) 
X LY,W|Z = XILY|W,Z (weak union) 
X ILY|ZandX LW|Y,Z = XALY,W|Z (contraction) 
X ILY|W,ZandX LW|Y,Z = XILY,W|Z (intersection). 


The existence of a strictly positive density suffices for the intersection prop- 
erty to hold. Necessary and sufficient conditions for the discrete case are 
provided by Drton et al. [2009b, Exercise 6.6] and by Fink [2011]. Peters 
[2014] covers the continuous case. 


The variance of a random variable X is defined as 


var[X] := E [(X — E[X])*] = E [X?] —E[x] 


if E[X?] < œ. 
We call X and Y uncorrelated if E[X?], E[Y?] < œ and 


EIXY] = E[X]E[Y], 


that is 


_ EXY]—-E[X]EY] _ 
pee var|X | var|Y | = 


Otherwise, that is, if Py y #0, X and Y are correlated. pyy is called the 
correlation coefficient between X and Y. 


If X and Y are independent, then they are uncorrelated: 


X IY => pxy =0. 
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The other direction does not necessarily hold (see Code Snippet A.1). Only 
in special cases, such as the bivariate Gaussian distribution or binary vari- 
ables, does the reversed direction hold, too. 


(xiii) We say that X and Y are partially uncorrelated given Z if 
Px,y — Px,zPz,y 


PYZ- = 7 3 
ya = Px z)(l — pzy) 


The following interpretation of partial correlation is important: py y |z equals 
the correlation between residuals after linearly regressing X on Z and Y on Z. 


=0. 


(xiv) In general, we have (see Example 7.9) 


Pxy|z =9 F X Y|Z and 
Pxyjzz=0 Æ X Y |Z. 


(xv) In regression estimation, we are usually given an i.i.d. sample (X1,Y1),..., 
(Xn, Yn) from a joint distribution Px y. Our aim is to predict the target Y from 
the covariates or predictors X. In least squares regression, for example, we 
are looking for a function f such that 


n 
Î=argmin $ (Y; - f(Xi))’. 
JEF j=l 

Here, we optimize over a function class F (see Section A.3). Different re- 
gression techniques use different function classes. In linear regression, we 
are only considering linear functions f; see Code Snippet 6.43 for an exam- 
ple. Code Snippet 4.14 shows an example for a nonlinear regression tech- 
nique. 

(xvi) Dependence between sets of discrete random variables X and Y can be mea- 
sured by the Shannon mutual information [Cover and Thomas, 1991] 


p(x,y) 


(X: Y):= Lpy) We ao] 


(xvii) Conditional dependence of sets of discrete random variables X and Y, given 
the set Z, is measured via the conditional Shannon mutual information 
[Cover and Thomas, 1991] 


VD. F pay poe PEND 
1(X:¥|Z) = Yl yz)! © (x|z)p(ylz) 
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(xviii) For continuous variables, the sums are replaced with integrals 


I(X:Y) = | pixy)tog P&H aay, 
and 
-Y|Z) = _ P(x yl) 
1(X:Y|Z): J oyog E dxdydz. 


A.2 Independence and Conditional Independence Testing 


In practice, we are given a finite sample (X1,Y1),...,(Xn,Yn) Px y and want to 
decide whether the underlying random variables are independent or not. Since 
we do not expect the empirical correlation (or any independence measure) to be 
exactly 0, we need to take into account random fluctuations of the dependence 
measures. This can be done by statistical hypothesis tests. The idea is to consider 
the null hypothesis Hp : X IL Y and the alternative H4 : X K Y. Therefore, one 
usually constructs a test statistic 7,, that maps any finite sample to a real number, 
and one decides according to 


Ho ifT,<c 
(1191), -e nYa) > { H Wo: 


Here, T, is shorthand notation for T,,((x1,1),---;(%n,Yn)). The threshold c € R is 
chosen such that we can control the type I error; that is, for any P satisfying Ho, 
we have P(T, >c) < a, where @ is the significance level of the test, specified by 
the user. In practice, we are given data and compute the statistic 7,,. If T, > c, the 
null hypothesis is rejected, and we can be relatively confident that our decision is 
correct; otherwise, the null hypothesis is not rejected, which does not necessarily 
mean much (it could be that the sample size n was too small to detect the depen- 
dence between X and Y). The p-value of a test is the smallest significance level, 
such that the test is rejected. 

We now briefly mention a couple of choices for 7,,. There are many more tests, 
however, and we do not claim that the list contains optimal procedures; see Code 
Snippet A.1 for a practical example. 


(i) To test for vanishing correlation, we can use the empirical correlation co- 
efficient and a t-test (for Gaussian variables) or Fisher’s z-transform (e.g., 
cor.test in R Core Team [2016]). 


A.2. Independence and Conditional Independence Testing 217 


(ii) As an independence test, we may use a ¥7-test for discrete or discretized 
data (e.g., chisq.test in R Core Team [2016]). 


(iii) An example for a general non-parametric independent test is the Hilbert- 
Schmidt Independence Criterion (HSIC) [see Gretton et al., 2008]. Its 
idea is based on an injective mapping into reproducing kernel Hilbert spaces 
(RKHSs) [Schölkopf and Smola, 2002]. Given a positive definite kernel, we 
can map probability distributions into the corresponding RKHS H, that is, 
Py y +> U(Px y) € H. For so-called characteristic kernels (e.g., the Gaussian 
kernel), this mapping is injective. In particular, we then have 


u(Pxy) = u(Px ® Py) if and only if Px y = Px 8 Fy, 


and the latter holds if and only if X and Y are independent. The HSIC is 
defined as the squared RKHS-distance between the joint distribution and the 
product of marginals: 


HSIC (Py y) := || H (Px y) — H (Px ® Py) 


As test statistic T, we can now use an estimator for HSIC(Py y). If X and Y 
are independent, HSIC (Px y) equals 0, and we expect its estimator T, to be 
small. Gretton et al. [2008] provide ways how to choose the threshold c. 

Alternatively, we can express HSIC as the Hilbert-Schmidt norm of the 
covariance operator Cxy. The latter is defined such that for all f and g that 
are members of the corresponding RKHSs 


(f,Cxvs) =E[f(X)s(¥)|-E[f(X)] Els). 


The cross-covariance operator is therefore an extension of the covariance 
matrix. If X is dy-dimensional, Y is dy-dimensional, and the corresponding 
RKHSs are isomorphic to R and RY, respectively, Cyy can be described 
with the dy x dy-dimensional cross-covariance matrix. Certainly, X and Y 
do not need to be independent if the covariance matrix vanishes. For char- 
acteristic kernels, however, the RKHSs are infinitely dimensional and not 
isomorphic to R?. The cross-covariance operator has zero norm if and only 
if X and Y are independent. 

Pfister et al. [2017] extend the procedure to test for joint independence 
between d variables. This is necessary to test for joint independence of noise 
variables, for example. They provide code for both the bivariate and the 
multivariate procedure (see the R-package dHSIC). 
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In practice, one usually needs to choose kernel parameters. For the Gaus- 
sian kernel, many implementations choose the bandwidth o according to the 
commonly named median heuristic [e.g., Gretton et al., 2008]. 


(iv) Conditional independence testing Conditional independence testing is a 
hard problem, especially if the conditioning set is large. While it is current 
research to obtain a precise formalization for this statement, we provide an 
example that indicates the hardness of the problem. If Z),...,Z, are binary 
variables, we have that 


X ALY |Z Za 
=> W(Ziynces8q € {0,1}: X UY | = Sc = Se 


If we cannot assume anything on the way X and Y may depend on the Z’s, 
we need to perform an unconditional independence test for each of the 2 as- 
signments (e.g., Zq could be a common child of X and Y with the dependence 
only detectable for a specific assignment of the other Z;,...,Zg_1). 

For continuous variables, extensions of the HSIC test have been proposed. 
Fukumizu et al. [2008] extend the idea to conditional cross-covariance oper- 
ators to obtain a conditional independence test. This is developed further by 
Zhang et al. [2011], who additionally provide an approximation of the test 
statistic’s distribution under the null hypothesis. 


Code Snippet A.1 The following code generates a sample of a distribution over 
two variables that are uncorrelated but dependent. 


library (dHSIC) 

# 

# generates a sample from two uncorrelated but dependent random variables 
set.seed(1) 

A <- runif(200)-0.5 

B <- runif(200)-0.5 

X <- t( c(cos(pi/4), -sin(pi/4)) %*% rbind(A, B) ) 
Y <- t( c(sin(pi/4), cos(pi/4)) %*% rbind(A, B) ) 
# 

# performs the statistical test 
cor.test(X,Y)$p.value 

# 0.3979561 

dhsic.test(X,Y)$p.value 

# 1.970705e-08 
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A.3 Capacity of Function Classes 


Here, we address the question whether the sequence of functions minimizing the 
empirical risk (1.3) converges against a function that also minimizes the risk (1.2); 
see Section 1.2. By the law of large numbers, we know that for any fixed f € F 
and € > 0, 

lim P ( |R[f] — Rim l f]| >£) =0, (A.4) 


n—-yoo PED 


with exponentially fast convergence governed by Chernov’s bound [e.g., Vapnik, 
1998]. However, this does not imply consistency of empirical risk minimization. 
This is due to the fact that we are choosing the function f by minimizing (1.3). 
This implies that even though the (x;,y;) are independent, the errors or losses 
5|f(xi) —yil are not. In this case, the law of large numbers in its usual form does 
not apply. It turns out that to get consistency, we need a uniform law of large 
numbers [Vapnik, 1998]. This amounts to 


lim P (san — Remplf]) > e) =0 (A.5) 


ats fEF 


for all € > 0, a property that depends on the function class F. 

How about choosing F = VŽ, in other words, all functions from ¥ to Y? Un- 
fortunately, this does not lead to (A.5), and the reasoning is as follows: Suppose 
that based on the available sample (1.1), we decide that f* is a good solution — for 
instance, since it satisfies f(x;) = y; for all i. In this case, let us construct another 
function f** that agrees with f* on the sample and disagrees everywhere else. If 
our distribution Py y possesses a density, then the probability of encountering any 
of the training points exactly again in the future is zero. As a consequence, f* and 
ft will almost always disagree. Based on the training set alone, however, there is 
no way to choose one over the other. Similarly, in (A.5) we would find that when- 
ever we have found a function f* for which (R[f*]— Remp|f*]) happens to be small, 


em| 
we can construct another function f** for which (R| P] — Remplf*"]) is large, so 
uniform convergence (A.5) is impossible to achieve in our considered case where 
F=y*. 

On the other hand, the condition (A.5) becomes weaker as we make F smaller. 
How one measures the size (or capacity) of F is beyond the scope of this book, 
but it turns out that for a summary of the size of F irrespective of the underly- 
ing distribution, a single number is enough. It is referred to as the VC (Vapnik- 
Chervonenkis) dimension of F. It sometimes coincides with the number of free 


parameters, but it can also be vastly different. If the VC dimension is finite, we 
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get consistency of empirical risk minimization for any Py y [Vapnik, 1998]. The 
VC dimension is related to falsifiability and Popper’s notion of the dimension of 
a theory [Corfield et al., 2009]. A typical risk bound of statistical learning theory 
states that for all 6 > 0, with probability 1 — 6 and for all f € F, we have 


it (ee +1) —log(6/4) A9 


, 
n 


R[F] < Remp 
where h is the VC dimension of the function class F. This means that if we can 
come up with an ¥ that has small VC dimension yet contains functions that are 
sufficiently suitable for the given task to achieve a small Rẹmp|f], then we can 
guarantee (with high probability) that those functions have small expected error on 
future data from the same distribution. This formulates a non-trivial trade-off: on 
the one hand, we would like to work with a large class of functions to allow for a 


small Remy, but on the other hand, we want the class to be small to control A. 


B 


Causal Orderings and Adjacency Matrices 


Definition B.1 Given a DAG G, we call a permutation, that is, a bijective mapping, 
m:{1,...,p}—> eee p}, 

a causal ordering (sometimes one says topological ordering) if it satisfies 
nli) < nj) if jeDE. 


Because of the acyclic structure of the DAG, there is always a topological order- 
ing (see Proposition B.2). But this order does not have to be unique. The node 
T~! (1) does not have any parents and is therefore a source node, and %7! (p) does 
not have any descendants and is thus a sink node. 


Proposition B.2 For each DAG there is a topological ordering. 


Proof. We proceed by induction. We need to show that in each DAG, there is 
a node without any ancestors. Start with any node and move to one of its parents 
(if there are any). You will never visit a parent that you have seen before (if you 
did there had been a directed cycle). After at most p— 1 steps you reach a node 
without any parent. 


Definition B.3 We can represent a directed graph G = (V,E) over d nodes with a 
binary d x d matrix A (taking values 0 or 1): 


Ajj =1 oS (i,j) EE. 


A is called the adjacency matrix of G. 
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This representation of DAGs is particularly useful for the efficient implementa- 
tion of algorithms. There are a couple of useful results transforming adjacency 
matrices, some of which we report here. 


Remark B.4 (i) Let A be the adjacency matrix for DAG G. The entry (i, j) of 
the squared matrix A? equals the number of paths of length two from i to j. 
This is because 


A; j =) AitAgy. 
k 
(ii) In general, we have 
Ak, = # paths of length k from i to j. 


(iii) If indices increase on directed paths, that is, j € DEY implies j > i, then the 
identity is a causal ordering and the adjacency matrix is upper triangular, that 
is, only the upper-right half of the matrix contains non-zeros. 


(iv) We may want to use sparse matrices when the graph is sparse to save space 
and/or computation time. 


The number of DAGs with d nodes have been studied by Robinson [1970, 1973] 
and independently by Stanley [1973]. The number of such matrices (or DAGs) is 
growing very quickly in d (see Table B.1). 

McKay [2004] proves the following equivalent description of DAGs which had 
been conjectured by Eric W. Weisstein. 


Theorem B.5 The matrix A is an adjacency matrix of a DAG G if and only if A+1d 
is a O-1-matrix with all eigenvalues being real and strictly greater than zero. 
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d Number of DAGs with d nodes 

1 1 

2 3 

3 25 

4 543 

5 29281 

6 3781503 

T 1138779265 

8 783702329343 

9 1213442454842881 
10 || 4175098976430598143 
11 || 31603459396418917607425 
12 || 521939651343829405020504063 
13 || 186766007444320351866648 16926721 


14 || 14394281410443983349417907 19839535103 

15 || 237725265553410354992180218286376719253505 

16 || 83756670773733320287699303047996412235223138303 

17 || 6270792119692388989944645260249492 190696355 1482675201 

18 || 994211953221595158952289 145923545245 16555026878588305014783 

19 || 332771901227107591736177573311261125883583076258421902583546773505 


Table B.1: The number of DAGs depending on the number d of nodes, taken from http: 
//oeis.org/Ak003024 [OEIS Foundation Inc., 2017]. The length of the numbers grows 
faster than any linear term. 


C 


Proofs 


C.1 Proof of Theorem 4.2 


We first state a lemma; its proof can be found in Peters [2008], for example. 


Lemma C.1 Let X and N be independent variables and assume that N is non- 
deterministic. Then N jt (X +N). 
Proof of Theorem 4.2. If X and Ny are normally distributed, we have 
ve cov[X,Y] _ ovar|X | 
` cov[Y,Y]  a@2var[X] + var[Ny] 
and define Ny := X — BY. Ny and Y are uncorrelated by construction and because 
Ny and Y are jointly Gaussian, it follows that they are independent, too. 
To prove the “only if” statement, we assume that 
Y= ax + Ny 
and Ny= (1—aB)xX — BNy 


are independent. Distinguish between the following cases: 


(i) (1—aB) £0 and B £0. 
Here, Theorem 4.3 implies that X,Ny and thus also Y, Ny are normally dis- 
tributed. Hence, Py y is bivariate Gaussian, too. 


Gi) B= 0: 
This implies 
X JIL aX +Ny, 


which is a contradiction to Lemma C.1. 
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(iti) (1 -—aB) =0. 
It follows —BNy IL aX + Ny. Thus 


Ny IL ax +WNy, 


which, again, contradicts Lemma C.1. 


This concludes the proof. 


C.2 Proof of Proposition 6.3 


Proof. Recall that our definition of an SCM includes the requirement that the 
underlying graph is acyclic. We can now substitute the structural assignments re- 
cursively into each other and can therefore write each node X; as a unique function 
of all noise terms (Nx) xe AN; that belong to the ancestors of X;. That is, 


Xj := g; ((Ne)kean, )- 


(The function does not necessarily depend on the noise terms of all ancestors.) 


C.3 Proof of Remark 6.6 


Proof. We will show that whenever we can remove a variable from PA ;, we can 
still remove it from PA¥ in the reduced model. 

Consider an input X; € PA ; N PAY that f; does not depend on. That is, we have 
Fi(Paj —1s%K,Nj) = fipa; -r Xpnj) for all xz, xp, pa; and nj with p(n;) > 0. 
Here, PA ; — := PA; \ {k} denotes the set of all input variables except for k. Then, g 
does not depend on this variable x; either because 8(pay p Xk, nj) = fj(Paj xe nj) 
for all xx,pa% _, and nj with p(n;) > 0. 


C.4 Proof of Proposition 6.13 


Proof. To simplify notation we write X; instead of X and X2 instead of Y. First, 
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the truncated factorization formula (6.9) implies 
~ Xı:=x1) = |px xj |Xpa(j)) )dx3-- -dxa 
JAl 
e (x; |Xpa()) ) Bee) Uige -dxa 
jAl P(x) 
€;do(X,:=N 
= pe (a2) (C.1) 


if Ñi puts positive mass on xı, that is, (x1) > 0. We furthermore require that 
the following two statements hold for all distributions Qy, x, over (X1,X2) with 


density q: 
i A . A 
XK XiinQ dx] xy with q(x} ),q(x7) > 0 and Qy x= = Oy, x= 
(C.2) 
and 
: aA: A 
X JL Xi inQ 4> Ax, with g(x, ) > 0 and Qy, Ixı=xô # Qx. (C.3) 
We then have for any N with full support 
x (C2) = €;do(X: ~ €;do(X1:=M, ) 
(i) Ax, x7 with pos. density under Ñ; s.t. P Za F Pox 
EY (ii) 
(Ci) = €;do(X: =) €;do(X1:=M) 
=> ax, ,xī with pos. density under MN s.t. Fg A F Pol oe 
EX (iv) 
= (i) 
(trivial) €;do(X1:=N7) 


We further have (ii) => (iii) and that P$, 
distribution Py. Together with ~(i) = —(ii), the 


~(i) => Xp 1L X; in PEND) 


= P, 


with Ny; having the 
latter implies 


(C3) p€;do(X:=N. €;do(X1:=N} . 

=x es D = PE XENI) for all xô with pı (x) >0 
i “:do(Xı:=xô A 

Ey Py. ea] = P¥, for all xô with piĉ) >0 

~(ii `^; do(X1:=x^ 

2) pëdohi=À) _ PE for all x4 

= ~(i) 
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Here, the symbol “=” denotes the negation of a statement. 


C.5 Proof of Proposition 6.14 


Proof. Statement (i) follows directly from the Markov property of the interven- 

tional SCM. The intervention removes the incoming edges into X, and if there is 

no direct path from X to Y in the original graph, X and Y are d-separated. 
Statement (ii) can be proved by a counterexample (see, e.g., Example 6.34). 


C.6 Proof of Proposition 6.36 


Proof. “if”: Assume that causal minimality is not satisfied. Then, there is an X; 
andaY € PAY, such that Px is also Markovian with respect to the graph obtained 
when removing the edge Y — X; from G. This implies X; JL Y | PAS \ {Y} by the 
local Markov property. 

“only if”: If Px has a density, the Markov condition is equivalent to the Markov 
factorization [Lauritzen, 1996, Theorem 3.27]. Assume now that Y € PAS and 
X; dL Y |PAY \ {Y }, which implies p(x;|pa’) z p(x;lpaĵ_y) where PAY , is de- 
fined as PAY , = PAY \{¥}. Then, p(x) = p(xj\pa_y) Tz; p(xelpaf), which 
implies that Px is Markovian with respect to G without Y — Xj. 


C.7 Proof of Proposition 6.48 


Proof. We assume that both models satisfy causal minimality and come with 
graphs G and H. Intuitively, we can identify the children of a node X since they 
change after intervening on X. Some of the children, however, may not change 
their distribution after an intervention due to two canceling paths, for example. We 
thus introduce the following notation. Given a DAG G, we call X a youngest par- 
ent of a node Y and write X € YPAy if X € PAy and X is not an ancestor of any 
other parent of Y. A node Y may have several youngest parents. The proof requires 
two arguments: 


di) IfX € YPAY, then there is a total causal effect from X to Y, meaning that 


lo(X:=x^ do(X:=x7 
there are x^® and x~, such that ee =) x pe e 


causal minimality. 


. This follows from 
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(ii) If Z E€ ANY, then there exist X,...,X,, such that X; = Z, X, = Y, and X; € 
YPAY. forie {1,...,4—1}. 


Finally, we can combine these two statements and conclude that if Z € ANY, then 
there are X,,...,X, such that for i € {1,...,k— 1}, X; has a total causal effect on 
Xi+1, which implies that there must be a direct causal path from X; to X;+ı also in 
H; see Proposition 6.13. But then Z € AN% , which implies that both G and H have 
the same ancestor relationships. Since both G and H satisfy causal minimality, this 
implies that G = H and therefore the two models are equivalent as causal graphical 
models. 


C.8 Proof of Proposition 6.49 


Proof. According to the proof of Proposition 6.3, we can write for the first SCM 
X = g(N). But since 


g(n)=g'(n) Vn with p(n) > 0, 


we Clearly have that both SCMs induce the same observational distributions (and 
intervention distributions with the same argument). Regarding counterfactuals, we 
cover both the discrete and the continuous case by conditioning on X € A with 
P(X € A) > 0; see Definition 6.17. The new density over the noise variables satis- 
fies 


P(X€A) if g(m,...,ng) EA 
0 else 


pii if g*(m,... na) EA 
0 else 
=p" 


P( nN\,. agit) 


an] b=) 
mi 
Z 
M 
> 


p(n1,..-,Na) 


P(g*(N)EA) if g*(m1,...,na) EA 
0 else 


(nj,...,Na)- 


We still have 
g(n)=g'(n) Vn with p(n) > 0, 


which implies that all counterfactual statements coincide. 
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C.9 Proof of Proposition 7.1 


Proof. Let N\,...,Nqa be independent and uniformly distributed between 0 and 1. 
We then define X; := fj(Xpa,,Nj) with 


fj(paj,7;) = Fy papa, (nj) (C.4) 


where Fy P = is the generalized inverse cumulative distribution function from 
JPAG=paj 


X; given PA; = pa ;. The generalized inverse cumulative distribution function of a 
random variable Y is defined as Fy ' (a) := inf{y € R : Fy(y) > a}. Equation (C.4) 
guarantees that in the constructed SCM, the conditional X;|PA ; = pa ; has the cor- 
rect distribution. The statement then follows from the Markov factorization, Defi- 
nition 6.21 (iii). 


C.10 Proof of Proposition 7.4 


Proof. Assume causal minimality is not satisfied. We can then find nodes j and 
i € PA; with X; = f;(PA;\{i},X:) +N; that does not depend on X; if we condition 
on all other parents A := PA ; \ {i}, that is X; JL X; |X4 (see Proposition 6.36). Here, 
we denote PA ; \ {X;} by X4. For the function fj, we will now show that fj(x4,xi) = 
Cx, for Py, x,-almost all (x4,x;). Indeed, assume without loss of generality that 
E [Nj] = 0, then the mean of X;|PA; = (x4,x;) equals fj(x4,xi). Equation (2b) 
from Dawid [1979] states that if X; JL X;|X4, then the density of X; |X4,X; does 
not depend on the argument of X;. Therefore, also the conditional mean fj (Aisi) 
does not depend on x;. It follows that f; (x4,%;) =Cx,. The continuity of f ; implies 
that f; is constant in its last argument. 

The converse statement follows from Proposition 6.36, too. 


C.11 Proof of Proposition 8.1 


Proof. We use the Bellman optimality equation [e.g., Sutton and Barto, 2015, 
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Chapter 3.8]. For all s° and s with f(s°) 


= (s'|s,a)( 


E[R|s‘,a 


231 


= f(s), we have 


+ max Q* (s',)) 
a’ 


=} } p(s |s,a)(E[R|s',a] +maxQ*(s',a’)) 
P S= Í 

= L p(f'|s,a) (E(R| /",4] +maxQ*(s',a')) 
F a 

=F p(s" |s°.4) (ERI f'a] +maxQ*(s',a')) = O*(s°,a). 
f j 


This concludes the proof. 


C.12 Proof of Proposition 8.2 


Proof. The first equation follows from the discussion in Section 8.2.1. The Markov 
factorization property implies 

P(x) = p(als) p(s|h) p(h) POLS, A) p(f la); 
see Figure 8.5. It now follows with F IL S|A that 


fp Plals) 


p(als) 


p(x) dx= |y Blals)p(s\h)p(M)pOlF.h)pUfla.s) da df dh ds dy 


B(f,a|s)p(s|h)p(h)pOy|f,h) dadf dh ds dy 


s) 
s) 
) 


p(s|h)p(h)pO|f,h)p(f|s) df dh ds dy 


S 


P(s|h)p(h)pOlf,A)p(f, als) dadf dh ds dy 


S 


P(fis) ” pire 
The last equality follows from p(f,a|s) = 


5) 
) 


P(fla,s)p(als). 


C.13 Proof of Proposition 9.3 


Proof. To show (i), we start with the SCM € over X and its entailed distribution 
Px. We then consider the structural assignments for variables O € O and repeat- 
edly plug in the assignments for the variables X € X \ O whenever these variables 
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appear on the right-hand side. This leads to a new SCM in which each structural 
assignment for O € O contains a multivariate error variable No. It is apparent that 
this smaller SCM entails the same observational distribution Po and the same in- 
tervention distributions when intervening on any O € O. From causal sufficiency, 
it follows that the new noise variables (No)oco are jointly independent. As in the 
case of one-dimensional noise variables (Proposition 6.31), this again implies that 
the distribution Po is Markovian with respect to the induced graph structure. The 
statement now follows from the fact that this new SCM can be transformed to an 
SCM with one-dimensional error variables that entails the same observational and 
intervention distributions (exploiting the same construction as in Proposition 7.1). 
For a more formal description of this procedure, as well as for more details on these 
arguments, see Bongers et al. [2016]. 
Statement (ii) follows from Example 9.2. 


C.14 Proof of Theorem 10.3 


Proof. If there is an arrow from X a t) to x the dependence (10.3) follows im- 
mediately from faithfulness because two directly connected variables cannot be 
d-separated. Now assume that there is no edge from X? to X*. Then, Xé is d- 
past(r) t t 
separated from X aa A given ek H(t)" Any path leaving X* with an outgoing edge is 
blocked because it will have a collider (and no node after with time index larger or 
equal toż is conditioned on); any path leaving X¥ with an incoming edge is blocked 


. . TE . =y 
because the next node is in the conditioning set X sast)" 


C.15 Proof of Theorem 10.4 


Proof. To prove (i), consider a full time graph containing no arrow from X to Y. 
Then, every path from Y, to Xpast(r) is blocked by Ypast(r)- Any path that starts with 
an outgoing edge from Y, must contain a collider that is not in the conditioning 
set (neither is any of its descendants); any path starting with an incoming edge is 
blocked since the first node on this path is in Ypast(r)- 

To prove (ii), assume Y, has parents from X, denoted by PAy. Then (10.5) implies 


Y, IL PAY, | Yoastit)- (C.5) 
For any X, € PAY, , (C.5) implies by weak union (see Appendix A.1) 


Y, IL X; | Ypast(t) U (PAX \ {Xs}). (C.6) 
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Due to Peters et al. [2014, Lemma 38], minimality implies that Y, is dependent of 
any parent A of Y;, given any set of non-descendants of Y, that includes the other 
parents of Y, except A. Hence we have 


Y, If Xs | Ypast(t) U (PAy, \ {Xs}), 


in contradiction to (C.6). 
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ACE, see average causal effect 

additive noise model, 48, 50, 52, 69, 
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ANM, see additive noise model 

arrow of time, 29, 50 

autoregressive models, 50, 199, 208 

average causal effect, 112, 116, 175, 
176 


backdoor criterion, 115 

Bayesian Dirichlet equivalence 
score, 150 

Bayesian Dirichlet score, 149 

Bayesian information criterion, 139, 
149, 150, 178 

Bayesian methods, 149 

BD score, see Bayesian Dirichlet 
score 

BDe score, see Bayesian Dirichlet 
equivalence score 

BIC, see Bayesian information crite- 
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CAM, see causal additive model 
causal additive model, 140 

causal discovery, see causal learning 
causal effect, see total causal effect 


causal learning, 135 

causal Markov condition, 105, 106 

causal minimality, 107, 108, 109 

causal sufficiency, see sufficiency 

choke points, 189 

collider, see graph 

common cause, 11, 95, 104, 129, 
172, 173, 175, 187, 206 

conditional independence, 214 

confounder, see common cause 

counterfactuals, 36, 96, 106 


DAG, see graph 

DCM, see dynamic causal modeling 
descendant, see graph 

directed acyclic graph, see graph 
distribution equivalence, 150 
dynamic causal modeling, 210 
dynamic programming, 151 


entropy 
Shannon entropy, 59, 67, 68, 
127, 187 
transfer entropy, 205, 206 
equal error variances, 139 


faithfulness, 107, 136 

FCI algorithm, 184 

fMRI, see functional magnetic reso- 
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functional magnetic resonance imag- 
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GES, see greedy equivalence search 
GIES, see greedy interventional 
equivalence search 
Granger causality, 201, 202, 203, 
204-206, 208, 211 

graph 
collider, 82 
d-separation, 83 
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directed acyclic graph (DAG), 
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maximal ancestral graph 
(MAG), 180 
parent, 82 
partially ancestral graph (PAG), 
180 
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partially oriented induced path 
graph (POIPG), 182 
path, 82 
v-structure, 82, 102, 145 
Y-structure, 177, 182, 184 
greedy equivalence search, 150 
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IC algorithm, 143, 144 
ICA, see independent component 
analysis 
ILP, see integer linear programming 
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causal mechanisms, 16, 47, 54, 
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random variables, 213 
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integer linear programming, 151 
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invariance 
Simon’s criterion, 24 
invariant 
causal prediction, 154 
conditionals, 113 
mechanisms, 20 
inverse probability weighting, 159, 
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IPG, see graph 


Kolmogorov complexity, 47, 59, 60, 
128 


latent projection, 179, 182 

linear non-Gaussian acyclic model, 
48-50, 139, 140, 208 

LiNGAM, see linear non-Gaussian 
acyclic model 


MAG, see graph 
marginalization, 174, 179 
Markov condition, 109 
Markov equivalence, 102 
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Markov property, 100, 101, 104, 105, 
108 
maximal ancestral graph, see graph 
mechanism, 9, 17 
independent, 19, 20 
invariant, 18, 20 


noises 
independent, 8 
non-descendant, see graph 
nondeterministic polynomial time, 
145, 150, 151 
NP, see nondeterministic polynomial 
time 


PAG, see graph 

parent, see graph 

partially ancestral graph, see graph 

partially directed acyclic graph, see 
graph 

partially oriented induced path graph, 
see graph 

path, see graph 

path model, 22 

PC algorithm, 143, 145, 179, 184 

PDAG, see graph 

POIPG, see graph 

potential outcomes, 122 

propensity score matching, 117 


random variable, 213 
regression, 215 
half-sibling, 157 
regression with subsequent indepen- 
dence test, 152 
RESIT, see regression with subse- 
quent independence test 
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SCM, see structural causal model 
selection bias, 104 
SEM, see structural equation model 
semi-supervised learning, 71 
SGS algorithm, 143, 144 
SIC, see spectral independence crite- 
rion 
Simpson’s paradox, 172, 174, 196 
spectral independence criterion, 208, 
209 
structural causal model, 9, 22, 33, 83 
structural equation model, see struc- 
tural causal model 
structure learning, see causal learn- 
ing 
sufficiency 
causal sufficiency, 171, 173 
interventional sufficiency, 171, 
172, 173 


tetrad constraints, 189 
time series 
full time graph, 198 
summary graph, 199, 200 
total causal effect, 91 
transfer entropy, see entropy 


v-structure, see graph 
variable 
endogenous, 23 
exogenous, 23 
visual perception, 15, 30 


Y-structure, see graph 
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